Buying and selling used phones and tablets used to be something that happened on a handful of online marketplace sites. But the used and refurbished device market has grown considerably over the past decade, and a new IDC (International Data Corporation) forecast predicts that the used phone market would be worth \$52.7bn by 2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for used phones and tablets that offer considerable savings compared with new models.
Refurbished and used devices continue to provide cost-effective alternatives to both consumers and businesses that are looking to save money when purchasing one. There are plenty of other benefits associated with the used device market. Used and refurbished devices can be sold with warranties and can also be insured with proof of purchase. Third-party vendors/platforms, such as Verizon, Amazon, etc., provide attractive offers to customers for refurbished devices. Maximizing the longevity of devices through second-hand trade also reduces their environmental impact and helps in recycling and reducing waste. The impact of the COVID-19 outbreak may further boost this segment as consumers cut back on discretionary spending and buy phones and tablets only for immediate needs.
The rising potential of this comparatively under-the-radar market fuels the need for an ML-based solution to develop a dynamic pricing strategy for used and refurbished devices. ReCell, a startup aiming to tap the potential in this market, has hired you as a data scientist. They want you to analyze the data provided and build a linear regression model to predict the price of a used phone/tablet and identify factors that significantly influence it.
The data contains the different attributes of used/refurbished phones and tablets. The data was collected in the year 2021. The detailed data dictionary is given below.
# Importing necessary libraries
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
from IPython.display import display
from matplotlib.ticker import MaxNLocator
# this will help in making the Python code more structured automatically (good coding practice)
!pip install black[jupyter] --quiet
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
# split the data into train and test
from sklearn.model_selection import train_test_split
# to build linear regression_model
from sklearn.linear_model import LinearRegression
# to check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# to build linear regression_model using statsmodels
import statsmodels.api as sm
# to compute VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed roles
pd.set_option("display.max_rows", 100)
custom = {"axes.edgecolor": "purple", "grid.linestyle": "solid", "grid.color": "black"}
sns.set_style("dark", rc=custom)
pd.set_option("display.float_format", lambda x: "{:.2f}".format(x))
%matplotlib inline
# let colab access my google drive
from google.colab import drive
drive.mount("/content/drive")
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
# read the data
df = pd.read_csv("/content/drive/MyDrive/Python_Course/Project_3/used_device_data.csv")
!black /content/drive/MyDrive/Python_Course/Project_3/'SLF_Project_LearnerNotebook_FullCode.ipynb'
reformatted /content/drive/MyDrive/Python_Course/Project_3/SLF_Project_LearnerNotebook_FullCode.ipynb All done! ✨ 🍰 ✨ 1 file reformatted.
# returns the first 5 rows
df.head()
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Honor | Android | 14.50 | yes | no | 13.00 | 5.00 | 64.00 | 3.00 | 3020.00 | 146.00 | 2020 | 127 | 4.31 | 4.72 |
| 1 | Honor | Android | 17.30 | yes | yes | 13.00 | 16.00 | 128.00 | 8.00 | 4300.00 | 213.00 | 2020 | 325 | 5.16 | 5.52 |
| 2 | Honor | Android | 16.69 | yes | yes | 13.00 | 8.00 | 128.00 | 8.00 | 4200.00 | 213.00 | 2020 | 162 | 5.11 | 5.88 |
| 3 | Honor | Android | 25.50 | yes | yes | 13.00 | 8.00 | 64.00 | 6.00 | 7250.00 | 480.00 | 2020 | 345 | 5.14 | 5.63 |
| 4 | Honor | Android | 15.32 | yes | no | 13.00 | 8.00 | 64.00 | 3.00 | 5000.00 | 185.00 | 2020 | 293 | 4.39 | 4.95 |
# View a random sample
df.sample(n=10, random_state=1)
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 866 | Others | Android | 15.24 | no | no | 8.00 | 2.00 | 16.00 | 4.00 | 3000.00 | 206.00 | 2014 | 632 | 4.04 | 5.19 |
| 957 | Celkon | Android | 10.16 | no | no | 3.15 | 0.30 | 512.00 | 0.25 | 1400.00 | 140.00 | 2013 | 637 | 2.80 | 3.88 |
| 280 | Infinix | Android | 15.39 | yes | no | NaN | 8.00 | 32.00 | 2.00 | 5000.00 | 185.00 | 2020 | 329 | 4.37 | 4.49 |
| 2150 | Oppo | Android | 12.83 | yes | no | 13.00 | 16.00 | 64.00 | 4.00 | 3200.00 | 148.00 | 2017 | 648 | 4.68 | 5.64 |
| 93 | LG | Android | 15.29 | yes | no | 13.00 | 5.00 | 32.00 | 3.00 | 3500.00 | 179.00 | 2019 | 216 | 4.52 | 5.30 |
| 1040 | Gionee | Android | 12.83 | yes | no | 13.00 | 8.00 | 32.00 | 4.00 | 3150.00 | 166.00 | 2016 | 970 | 4.65 | 5.63 |
| 3170 | ZTE | Others | 10.16 | no | no | 3.15 | 5.00 | 16.00 | 4.00 | 1400.00 | 125.00 | 2014 | 1007 | 3.76 | 4.24 |
| 2742 | Sony | Android | 12.70 | yes | no | 20.70 | 2.00 | 16.00 | 4.00 | 3000.00 | 170.00 | 2013 | 1060 | 4.42 | 5.80 |
| 102 | Meizu | Android | 15.29 | yes | no | NaN | 20.00 | 128.00 | 6.00 | 3600.00 | 165.00 | 2019 | 332 | 4.96 | 6.04 |
| 1195 | HTC | Android | 10.29 | no | no | 8.00 | 2.00 | 32.00 | 4.00 | 2000.00 | 146.00 | 2015 | 892 | 4.23 | 4.88 |
# Checking shape of the data (rows by columns)
df.shape
print("There are", df.shape[0], "rows and", df.shape[1], "columns in the data frame.")
print()
# Use info() to print a concise summary of the DataFrame
df.info()
(3454, 15)
There are 3454 rows and 15 columns in the data frame. <class 'pandas.core.frame.DataFrame'> RangeIndex: 3454 entries, 0 to 3453 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3454 non-null object 1 os 3454 non-null object 2 screen_size 3454 non-null float64 3 4g 3454 non-null object 4 5g 3454 non-null object 5 main_camera_mp 3275 non-null float64 6 selfie_camera_mp 3452 non-null float64 7 int_memory 3450 non-null float64 8 ram 3450 non-null float64 9 battery 3448 non-null float64 10 weight 3447 non-null float64 11 release_year 3454 non-null int64 12 days_used 3454 non-null int64 13 normalized_used_price 3454 non-null float64 14 normalized_new_price 3454 non-null float64 dtypes: float64(9), int64(2), object(4) memory usage: 404.9+ KB
Observations:
There are 3454 rows and 15 columns in the data frame.
brand_name, os, 4g, and 5g have a Dtype of oject. These need converted to categorical variables.
screen_size, main_camera_mp, selfie_camera_mp, int_memory, ram, battery, weight, normalized_used_price, and normalized_new_price have a Dtype of float.
release_year and days_used have a Dtype of integer.
main_camera_mp, int_memory, ram, selfie_camera_mp, battery, and weight columns have missing values.
# Copy data to avoid any changes to original date
df2 = df.copy()
# coverting "objects" to "category" reduces the data space required to store the dataframe
# converting 'brand_name', 'os', '4g', and 5g into categorical data
for col in ["brand_name", "os", "4g", "5g"]:
df2[col] = df2[col].astype("category")
# Use info() to print a concise summary of the DataFrame
df2.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3454 entries, 0 to 3453 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3454 non-null category 1 os 3454 non-null category 2 screen_size 3454 non-null float64 3 4g 3454 non-null category 4 5g 3454 non-null category 5 main_camera_mp 3275 non-null float64 6 selfie_camera_mp 3452 non-null float64 7 int_memory 3450 non-null float64 8 ram 3450 non-null float64 9 battery 3448 non-null float64 10 weight 3447 non-null float64 11 release_year 3454 non-null int64 12 days_used 3454 non-null int64 13 normalized_used_price 3454 non-null float64 14 normalized_new_price 3454 non-null float64 dtypes: category(4), float64(9), int64(2) memory usage: 312.2 KB
Observations:
All variable type are now correct and the memory usage has decreased 116 kb.
Exploratory Data Analysis
Data Preprocessing
# shows statistical information
df2.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| brand_name | 3454 | 34 | Others | 502 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| os | 3454 | 4 | Android | 3214 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| screen_size | 3454.00 | NaN | NaN | NaN | 13.71 | 3.81 | 5.08 | 12.70 | 12.83 | 15.34 | 30.71 |
| 4g | 3454 | 2 | yes | 2335 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5g | 3454 | 2 | no | 3302 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| main_camera_mp | 3275.00 | NaN | NaN | NaN | 9.46 | 4.82 | 0.08 | 5.00 | 8.00 | 13.00 | 48.00 |
| selfie_camera_mp | 3452.00 | NaN | NaN | NaN | 6.55 | 6.97 | 0.00 | 2.00 | 5.00 | 8.00 | 32.00 |
| int_memory | 3450.00 | NaN | NaN | NaN | 54.57 | 84.97 | 0.01 | 16.00 | 32.00 | 64.00 | 1024.00 |
| ram | 3450.00 | NaN | NaN | NaN | 4.04 | 1.37 | 0.02 | 4.00 | 4.00 | 4.00 | 12.00 |
| battery | 3448.00 | NaN | NaN | NaN | 3133.40 | 1299.68 | 500.00 | 2100.00 | 3000.00 | 4000.00 | 9720.00 |
| weight | 3447.00 | NaN | NaN | NaN | 182.75 | 88.41 | 69.00 | 142.00 | 160.00 | 185.00 | 855.00 |
| release_year | 3454.00 | NaN | NaN | NaN | 2015.97 | 2.30 | 2013.00 | 2014.00 | 2015.50 | 2018.00 | 2020.00 |
| days_used | 3454.00 | NaN | NaN | NaN | 674.87 | 248.58 | 91.00 | 533.50 | 690.50 | 868.75 | 1094.00 |
| normalized_used_price | 3454.00 | NaN | NaN | NaN | 4.36 | 0.59 | 1.54 | 4.03 | 4.41 | 4.76 | 6.62 |
| normalized_new_price | 3454.00 | NaN | NaN | NaN | 5.23 | 0.68 | 2.90 | 4.79 | 5.25 | 5.67 | 7.85 |
Observations:
Brand_name has 34 unique values
os has 4 unique values
4g and 5g have 2 unique valaues (yes, no)
release_year ranges from 2013 to 2020
days_used ranges from 91 days to 1094 days
battery ranges from 500 to 9720
weight ranges from 69 to 855
rams range from 0.02 to 12
# Identifies columns that have missing information
df2.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 179 selfie_camera_mp 2 int_memory 4 ram 4 battery 6 weight 7 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 dtype: int64
# Breakout of release_year in the data
df2["release_year"].value_counts()
release_year 2014 642 2013 570 2015 515 2019 446 2016 383 2018 322 2017 299 2020 277 Name: count, dtype: int64
# Breakout of os in the data
df2["os"].value_counts()
os Android 3214 Others 137 Windows 67 iOS 36 Name: count, dtype: int64
There are 4 types of os in the data.
Most of the data shows the os is Android.
# Breakout of brand names in the data
df2["brand_name"].value_counts()
brand_name Others 502 Samsung 341 Huawei 251 LG 201 Lenovo 171 ZTE 140 Xiaomi 132 Oppo 129 Asus 122 Alcatel 121 Vivo 117 Micromax 117 Honor 116 HTC 110 Nokia 106 Motorola 106 Sony 86 Meizu 62 Gionee 56 Acer 51 XOLO 49 Panasonic 47 Realme 41 Apple 39 Lava 36 Celkon 33 Spice 30 Karbonn 29 Microsoft 22 OnePlus 22 Coolpad 22 BlackBerry 22 Google 15 Infinix 10 Name: count, dtype: int64
There are 34 different brands in the data.
# checking for unique values
df2.nunique()
brand_name 34 os 4 screen_size 142 4g 2 5g 2 main_camera_mp 41 selfie_camera_mp 37 int_memory 15 ram 12 battery 324 weight 555 release_year 8 days_used 924 normalized_used_price 3094 normalized_new_price 2988 dtype: int64
Will need to look if the os or the brand name should be dropped since all brand names will be one of the os.
Missing Values
Missing Main Camera Megapixels
# Explore missing values from Main camera pixels
missing_main_camera = df2[df2["main_camera_mp"].isnull()]
missing_main_camera.head()
print()
# What os have missing values
missing_main_camera["os"].value_counts()
print()
# What brand names have missing data
missing_main_camera["brand_name"].value_counts()
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 59 | Infinix | Android | 17.32 | yes | no | NaN | 8.00 | 32.00 | 2.00 | 6000.00 | 209.00 | 2020 | 245 | 4.28 | 4.60 |
| 60 | Infinix | Android | 15.39 | yes | no | NaN | 8.00 | 64.00 | 4.00 | 5000.00 | 185.00 | 2020 | 173 | 4.36 | 4.71 |
| 61 | Infinix | Android | 15.39 | yes | no | NaN | 8.00 | 32.00 | 2.00 | 5000.00 | 185.00 | 2020 | 256 | 4.18 | 4.51 |
| 62 | Infinix | Android | 15.39 | yes | no | NaN | 16.00 | 32.00 | 3.00 | 4000.00 | 178.00 | 2019 | 316 | 4.56 | 4.60 |
| 63 | Infinix | Android | 15.29 | yes | no | NaN | 16.00 | 32.00 | 2.00 | 4000.00 | 165.00 | 2019 | 468 | 4.42 | 4.87 |
os Android 179 Others 0 Windows 0 iOS 0 Name: count, dtype: int64
brand_name Realme 36 Xiaomi 23 Oppo 20 Motorola 18 OnePlus 17 Meizu 15 Vivo 14 Infinix 10 Sony 7 ZTE 4 Asus 4 BlackBerry 4 Coolpad 3 Lava 2 Panasonic 2 Honor 0 Apple 0 XOLO 0 Spice 0 Samsung 0 Celkon 0 Others 0 Gionee 0 Huawei 0 Nokia 0 Google 0 Microsoft 0 Micromax 0 Alcatel 0 Lenovo 0 LG 0 Karbonn 0 HTC 0 Acer 0 Name: count, dtype: int64
179 Androids have missing main camera megapixels data.
15 brands of the Androids have missing main camera megapixels data.
Missing Selfie Camera Megapixels
# Explore missing values from Selfie camera pixels
missing_selfie_camera = df2[df2["selfie_camera_mp"].isnull()]
missing_selfie_camera.head()
print()
# What os have missing values
missing_selfie_camera["os"].value_counts()
print()
# What brand names have missing data
missing_selfie_camera["brand_name"].value_counts()
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1080 | Android | 15.32 | yes | no | 12.20 | NaN | 64.00 | 4.00 | 3430.00 | 184.00 | 2018 | 475 | 5.57 | 6.87 | |
| 1081 | Android | 12.83 | yes | no | 12.20 | NaN | 64.00 | 4.00 | 2915.00 | 148.00 | 2018 | 424 | 4.46 | 6.75 |
os Android 2 Others 0 Windows 0 iOS 0 Name: count, dtype: int64
brand_name Google 2 Realme 0 Motorola 0 Nokia 0 OnePlus 0 Oppo 0 Others 0 Panasonic 0 Acer 0 Micromax 0 Samsung 0 Sony 0 Spice 0 Vivo 0 XOLO 0 Xiaomi 0 Microsoft 0 Meizu 0 Alcatel 0 Lenovo 0 Lava 0 LG 0 Karbonn 0 Infinix 0 Huawei 0 Honor 0 HTC 0 Gionee 0 Coolpad 0 Celkon 0 BlackBerry 0 Asus 0 Apple 0 ZTE 0 Name: count, dtype: int64
Only 2 Google Androids have missing selfie camera megapixels data.
Missing Internal Memory
# Explore missing values from Internal Memory
missing_int_memory = df2[df2["int_memory"].isnull()]
missing_int_memory.head()
print()
# What os have missing values
missing_int_memory["os"].value_counts()
print()
# What brand names have missing data
missing_int_memory["brand_name"].value_counts()
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 117 | Nokia | Others | 5.18 | yes | no | 0.30 | 0.00 | NaN | 0.02 | 1200.00 | 86.50 | 2019 | 234 | 2.72 | 3.69 |
| 2035 | Nokia | Others | 5.18 | no | no | 5.00 | 0.00 | NaN | 0.03 | 1020.00 | 157.00 | 2019 | 501 | 2.34 | 3.42 |
| 2064 | Nokia | Others | 5.18 | no | no | 0.30 | 0.00 | NaN | 0.02 | 1100.00 | 78.40 | 2015 | 559 | 2.59 | 3.38 |
| 2092 | Nokia | Others | 7.62 | no | no | 5.00 | 0.00 | NaN | 0.02 | 1010.00 | 100.00 | 2013 | 1043 | 3.54 | 4.37 |
os Others 4 Android 0 Windows 0 iOS 0 Name: count, dtype: int64
brand_name Nokia 4 Acer 0 Panasonic 0 Microsoft 0 Motorola 0 OnePlus 0 Oppo 0 Others 0 Realme 0 Alcatel 0 Samsung 0 Sony 0 Spice 0 Vivo 0 XOLO 0 Xiaomi 0 Micromax 0 Meizu 0 Lenovo 0 Lava 0 LG 0 Karbonn 0 Infinix 0 Huawei 0 Honor 0 HTC 0 Google 0 Gionee 0 Coolpad 0 Celkon 0 BlackBerry 0 Asus 0 Apple 0 ZTE 0 Name: count, dtype: int64
4 Nokias are missing internal memory (ROM) in gb data.
Missing Ram
# Explore missing values from Ram
missing_ram = df2[df2["ram"].isnull()]
missing_ram.head()
print()
# What os have missing values
missing_ram["os"].value_counts()
print()
# What brand names have missing data
missing_ram["brand_name"].value_counts()
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 114 | Nokia | Others | 5.18 | no | no | 0.30 | 0.00 | 0.06 | NaN | 1020.00 | 91.30 | 2020 | 288 | 2.73 | 2.91 |
| 335 | Nokia | Others | 5.18 | no | no | 0.30 | 0.00 | 0.10 | NaN | 1200.00 | 88.20 | 2020 | 327 | 3.06 | 3.69 |
| 2059 | Nokia | Others | 5.18 | no | no | 0.30 | 0.00 | 0.06 | NaN | NaN | 82.60 | 2016 | 1023 | 2.77 | 3.66 |
| 2090 | Nokia | Others | 7.62 | no | no | 5.00 | 0.00 | 0.06 | NaN | 1200.00 | 111.40 | 2013 | 1001 | 3.83 | 4.61 |
os Others 4 Android 0 Windows 0 iOS 0 Name: count, dtype: int64
brand_name Nokia 4 Acer 0 Panasonic 0 Microsoft 0 Motorola 0 OnePlus 0 Oppo 0 Others 0 Realme 0 Alcatel 0 Samsung 0 Sony 0 Spice 0 Vivo 0 XOLO 0 Xiaomi 0 Micromax 0 Meizu 0 Lenovo 0 Lava 0 LG 0 Karbonn 0 Infinix 0 Huawei 0 Honor 0 HTC 0 Google 0 Gionee 0 Coolpad 0 Celkon 0 BlackBerry 0 Asus 0 Apple 0 ZTE 0 Name: count, dtype: int64
4 Nokias are missing ram data.
Missing Battery
# Explore missing values from Battery
missing_battery = df2[df2["battery"].isnull()]
missing_battery.head()
print()
# What os have missing values
missing_battery["os"].value_counts()
print()
# What brand names have missing data
missing_battery["brand_name"].value_counts()
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1829 | Meizu | Android | 12.83 | yes | no | 13.00 | 5.00 | 16.00 | 4.00 | NaN | 145.00 | 2014 | 986 | 4.18 | 4.86 |
| 1831 | Meizu | Android | 12.83 | yes | no | 20.70 | 5.00 | 16.00 | 4.00 | NaN | 158.00 | 2014 | 1043 | 4.88 | 5.99 |
| 1832 | Meizu | Android | 13.61 | yes | no | 20.70 | 2.00 | 16.00 | 4.00 | NaN | 147.00 | 2014 | 1007 | 4.74 | 5.83 |
| 1962 | Microsoft | Windows | 25.55 | no | no | 5.00 | 3.50 | 32.00 | 4.00 | NaN | 675.90 | 2013 | 931 | 5.23 | 5.80 |
| 2058 | Nokia | Others | 5.18 | no | no | 0.30 | 0.00 | 0.06 | 0.02 | NaN | 81.00 | 2016 | 815 | 2.72 | 3.37 |
os Android 3 Others 2 Windows 1 iOS 0 Name: count, dtype: int64
brand_name Meizu 3 Nokia 2 Microsoft 1 Panasonic 0 Motorola 0 OnePlus 0 Oppo 0 Others 0 Realme 0 Alcatel 0 Samsung 0 Sony 0 Spice 0 Vivo 0 XOLO 0 Xiaomi 0 Micromax 0 Acer 0 Lenovo 0 Lava 0 LG 0 Karbonn 0 Infinix 0 Huawei 0 Honor 0 HTC 0 Google 0 Gionee 0 Coolpad 0 Celkon 0 BlackBerry 0 Asus 0 Apple 0 ZTE 0 Name: count, dtype: int64
Meizu, Nokia and Microsoft all have missing battery data.
They are all on different os.
Missing Weight
# Explore missing values from Weight
missing_weight = df2[df2["weight"].isnull()]
missing_weight.head()
print()
# What os have missing values
missing_weight["os"].value_counts()
print()
# What brand names have missing data
missing_weight["brand_name"].value_counts()
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3002 | XOLO | Android | 12.70 | yes | no | 13.00 | 5.00 | 32.00 | 4.00 | 2400.00 | NaN | 2015 | 576 | 4.17 | 4.93 |
| 3003 | XOLO | Android | 12.83 | yes | no | 8.00 | 5.00 | 16.00 | 4.00 | 3200.00 | NaN | 2015 | 800 | 4.28 | 5.19 |
| 3004 | XOLO | Android | 12.70 | no | no | 8.00 | 2.00 | 32.00 | 4.00 | 2100.00 | NaN | 2015 | 878 | 3.88 | 4.08 |
| 3005 | XOLO | Android | 10.29 | no | no | 5.00 | 0.30 | 32.00 | 4.00 | 1800.00 | NaN | 2015 | 1036 | 3.82 | 4.40 |
| 3006 | XOLO | Android | 12.70 | no | no | 5.00 | 0.30 | 16.00 | 4.00 | 2500.00 | NaN | 2015 | 679 | 3.84 | 4.35 |
os Android 6 Windows 1 Others 0 iOS 0 Name: count, dtype: int64
brand_name XOLO 7 Acer 0 Panasonic 0 Microsoft 0 Motorola 0 Nokia 0 OnePlus 0 Oppo 0 Others 0 Realme 0 Alcatel 0 Samsung 0 Sony 0 Spice 0 Vivo 0 Xiaomi 0 Micromax 0 Meizu 0 Lenovo 0 Lava 0 LG 0 Karbonn 0 Infinix 0 Huawei 0 Honor 0 HTC 0 Google 0 Gionee 0 Coolpad 0 Celkon 0 BlackBerry 0 Asus 0 Apple 0 ZTE 0 Name: count, dtype: int64
There are 7 XOLO this is missing weight data.
6 showing an OS and 1 shows a Windows os.
This should be checked to see if the 7 phone really has a Windows os.
Observations:
Most of the missing data is on Androids.
19 Brands have missing data.
Need to decide if os or brand name is not necessary. Will test this to see which one should be dropped.
Most of the data is on Androids implying more Androids are refurbished that any other os.
Update Main Camera Megapixels
# Create a groupby for brand name, os and main camera mp and calculate the median for non NaN main camera mp
df2["main_camera_mp"] = df2["main_camera_mp"].fillna(
value=df2.groupby(["os", "brand_name"])["main_camera_mp"].transform("median")
)
df2
# Researched the main camera mp on an Infinx phone and it is 108 mp
# Replace NaN with 108
infinix_main_camera_mp = df2["brand_name"] == "Infinix"
df2.loc[infinix_main_camera_mp, "main_camera_mp"] = 108
# See if there are missing values still
df2["main_camera_mp"].isnull().sum()
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Honor | Android | 14.50 | yes | no | 13.00 | 5.00 | 64.00 | 3.00 | 3020.00 | 146.00 | 2020 | 127 | 4.31 | 4.72 |
| 1 | Honor | Android | 17.30 | yes | yes | 13.00 | 16.00 | 128.00 | 8.00 | 4300.00 | 213.00 | 2020 | 325 | 5.16 | 5.52 |
| 2 | Honor | Android | 16.69 | yes | yes | 13.00 | 8.00 | 128.00 | 8.00 | 4200.00 | 213.00 | 2020 | 162 | 5.11 | 5.88 |
| 3 | Honor | Android | 25.50 | yes | yes | 13.00 | 8.00 | 64.00 | 6.00 | 7250.00 | 480.00 | 2020 | 345 | 5.14 | 5.63 |
| 4 | Honor | Android | 15.32 | yes | no | 13.00 | 8.00 | 64.00 | 3.00 | 5000.00 | 185.00 | 2020 | 293 | 4.39 | 4.95 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3449 | Asus | Android | 15.34 | yes | no | 13.00 | 8.00 | 64.00 | 6.00 | 5000.00 | 190.00 | 2019 | 232 | 4.49 | 6.48 |
| 3450 | Asus | Android | 15.24 | yes | no | 13.00 | 8.00 | 128.00 | 8.00 | 4000.00 | 200.00 | 2018 | 541 | 5.04 | 6.25 |
| 3451 | Alcatel | Android | 15.80 | yes | no | 13.00 | 5.00 | 32.00 | 3.00 | 4000.00 | 165.00 | 2020 | 201 | 4.36 | 4.53 |
| 3452 | Alcatel | Android | 15.80 | yes | no | 13.00 | 5.00 | 32.00 | 2.00 | 4000.00 | 160.00 | 2020 | 149 | 4.35 | 4.62 |
| 3453 | Alcatel | Android | 12.83 | yes | no | 13.00 | 5.00 | 16.00 | 2.00 | 4000.00 | 168.00 | 2020 | 176 | 4.13 | 4.28 |
3454 rows × 15 columns
0
Update Selfie Camera Megapixels
# Create a groupby for brand name, os and selfie camera mp and calculate the median for non NaN selfie camera mp
df2["selfie_camera_mp"] = df2["selfie_camera_mp"].fillna(
value=df2.groupby(["os", "brand_name"])["selfie_camera_mp"].transform("median")
)
df2
# See if there is still missing values
df2["selfie_camera_mp"].isnull().sum()
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Honor | Android | 14.50 | yes | no | 13.00 | 5.00 | 64.00 | 3.00 | 3020.00 | 146.00 | 2020 | 127 | 4.31 | 4.72 |
| 1 | Honor | Android | 17.30 | yes | yes | 13.00 | 16.00 | 128.00 | 8.00 | 4300.00 | 213.00 | 2020 | 325 | 5.16 | 5.52 |
| 2 | Honor | Android | 16.69 | yes | yes | 13.00 | 8.00 | 128.00 | 8.00 | 4200.00 | 213.00 | 2020 | 162 | 5.11 | 5.88 |
| 3 | Honor | Android | 25.50 | yes | yes | 13.00 | 8.00 | 64.00 | 6.00 | 7250.00 | 480.00 | 2020 | 345 | 5.14 | 5.63 |
| 4 | Honor | Android | 15.32 | yes | no | 13.00 | 8.00 | 64.00 | 3.00 | 5000.00 | 185.00 | 2020 | 293 | 4.39 | 4.95 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3449 | Asus | Android | 15.34 | yes | no | 13.00 | 8.00 | 64.00 | 6.00 | 5000.00 | 190.00 | 2019 | 232 | 4.49 | 6.48 |
| 3450 | Asus | Android | 15.24 | yes | no | 13.00 | 8.00 | 128.00 | 8.00 | 4000.00 | 200.00 | 2018 | 541 | 5.04 | 6.25 |
| 3451 | Alcatel | Android | 15.80 | yes | no | 13.00 | 5.00 | 32.00 | 3.00 | 4000.00 | 165.00 | 2020 | 201 | 4.36 | 4.53 |
| 3452 | Alcatel | Android | 15.80 | yes | no | 13.00 | 5.00 | 32.00 | 2.00 | 4000.00 | 160.00 | 2020 | 149 | 4.35 | 4.62 |
| 3453 | Alcatel | Android | 12.83 | yes | no | 13.00 | 5.00 | 16.00 | 2.00 | 4000.00 | 168.00 | 2020 | 176 | 4.13 | 4.28 |
3454 rows × 15 columns
0
Update Internal Memory
# Create a groupby for brand name, os, release year and int memory and calculate the median for non NaN int memory
df2["int_memory"] = df2["int_memory"].fillna(
value=df2.groupby(["os", "brand_name", "release_year"])["int_memory"].transform(
"median"
)
)
df2
# See if there is still missing values
df2["int_memory"].isnull().sum()
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Honor | Android | 14.50 | yes | no | 13.00 | 5.00 | 64.00 | 3.00 | 3020.00 | 146.00 | 2020 | 127 | 4.31 | 4.72 |
| 1 | Honor | Android | 17.30 | yes | yes | 13.00 | 16.00 | 128.00 | 8.00 | 4300.00 | 213.00 | 2020 | 325 | 5.16 | 5.52 |
| 2 | Honor | Android | 16.69 | yes | yes | 13.00 | 8.00 | 128.00 | 8.00 | 4200.00 | 213.00 | 2020 | 162 | 5.11 | 5.88 |
| 3 | Honor | Android | 25.50 | yes | yes | 13.00 | 8.00 | 64.00 | 6.00 | 7250.00 | 480.00 | 2020 | 345 | 5.14 | 5.63 |
| 4 | Honor | Android | 15.32 | yes | no | 13.00 | 8.00 | 64.00 | 3.00 | 5000.00 | 185.00 | 2020 | 293 | 4.39 | 4.95 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3449 | Asus | Android | 15.34 | yes | no | 13.00 | 8.00 | 64.00 | 6.00 | 5000.00 | 190.00 | 2019 | 232 | 4.49 | 6.48 |
| 3450 | Asus | Android | 15.24 | yes | no | 13.00 | 8.00 | 128.00 | 8.00 | 4000.00 | 200.00 | 2018 | 541 | 5.04 | 6.25 |
| 3451 | Alcatel | Android | 15.80 | yes | no | 13.00 | 5.00 | 32.00 | 3.00 | 4000.00 | 165.00 | 2020 | 201 | 4.36 | 4.53 |
| 3452 | Alcatel | Android | 15.80 | yes | no | 13.00 | 5.00 | 32.00 | 2.00 | 4000.00 | 160.00 | 2020 | 149 | 4.35 | 4.62 |
| 3453 | Alcatel | Android | 12.83 | yes | no | 13.00 | 5.00 | 16.00 | 2.00 | 4000.00 | 168.00 | 2020 | 176 | 4.13 | 4.28 |
3454 rows × 15 columns
0
Update Ram
# Create a groupby for brand name, os, release year and ram and calculate the median for non NaN int memory
df2["ram"] = df2["ram"].fillna(
value=df2.groupby(["os", "brand_name", "release_year"])["ram"].transform("median")
)
df2
# See if there is still missing values
df2["ram"].isnull().sum()
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Honor | Android | 14.50 | yes | no | 13.00 | 5.00 | 64.00 | 3.00 | 3020.00 | 146.00 | 2020 | 127 | 4.31 | 4.72 |
| 1 | Honor | Android | 17.30 | yes | yes | 13.00 | 16.00 | 128.00 | 8.00 | 4300.00 | 213.00 | 2020 | 325 | 5.16 | 5.52 |
| 2 | Honor | Android | 16.69 | yes | yes | 13.00 | 8.00 | 128.00 | 8.00 | 4200.00 | 213.00 | 2020 | 162 | 5.11 | 5.88 |
| 3 | Honor | Android | 25.50 | yes | yes | 13.00 | 8.00 | 64.00 | 6.00 | 7250.00 | 480.00 | 2020 | 345 | 5.14 | 5.63 |
| 4 | Honor | Android | 15.32 | yes | no | 13.00 | 8.00 | 64.00 | 3.00 | 5000.00 | 185.00 | 2020 | 293 | 4.39 | 4.95 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3449 | Asus | Android | 15.34 | yes | no | 13.00 | 8.00 | 64.00 | 6.00 | 5000.00 | 190.00 | 2019 | 232 | 4.49 | 6.48 |
| 3450 | Asus | Android | 15.24 | yes | no | 13.00 | 8.00 | 128.00 | 8.00 | 4000.00 | 200.00 | 2018 | 541 | 5.04 | 6.25 |
| 3451 | Alcatel | Android | 15.80 | yes | no | 13.00 | 5.00 | 32.00 | 3.00 | 4000.00 | 165.00 | 2020 | 201 | 4.36 | 4.53 |
| 3452 | Alcatel | Android | 15.80 | yes | no | 13.00 | 5.00 | 32.00 | 2.00 | 4000.00 | 160.00 | 2020 | 149 | 4.35 | 4.62 |
| 3453 | Alcatel | Android | 12.83 | yes | no | 13.00 | 5.00 | 16.00 | 2.00 | 4000.00 | 168.00 | 2020 | 176 | 4.13 | 4.28 |
3454 rows × 15 columns
0
Update Battery
# Create a groupby for brand name, os and battery and calculate the median for non NaN main camera mp
df2["battery"] = df2["battery"].fillna(
value=df2.groupby(["os", "brand_name"])["battery"].transform("median")
)
df2
# See if there is still missing values
df2["battery"].isnull().sum()
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Honor | Android | 14.50 | yes | no | 13.00 | 5.00 | 64.00 | 3.00 | 3020.00 | 146.00 | 2020 | 127 | 4.31 | 4.72 |
| 1 | Honor | Android | 17.30 | yes | yes | 13.00 | 16.00 | 128.00 | 8.00 | 4300.00 | 213.00 | 2020 | 325 | 5.16 | 5.52 |
| 2 | Honor | Android | 16.69 | yes | yes | 13.00 | 8.00 | 128.00 | 8.00 | 4200.00 | 213.00 | 2020 | 162 | 5.11 | 5.88 |
| 3 | Honor | Android | 25.50 | yes | yes | 13.00 | 8.00 | 64.00 | 6.00 | 7250.00 | 480.00 | 2020 | 345 | 5.14 | 5.63 |
| 4 | Honor | Android | 15.32 | yes | no | 13.00 | 8.00 | 64.00 | 3.00 | 5000.00 | 185.00 | 2020 | 293 | 4.39 | 4.95 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3449 | Asus | Android | 15.34 | yes | no | 13.00 | 8.00 | 64.00 | 6.00 | 5000.00 | 190.00 | 2019 | 232 | 4.49 | 6.48 |
| 3450 | Asus | Android | 15.24 | yes | no | 13.00 | 8.00 | 128.00 | 8.00 | 4000.00 | 200.00 | 2018 | 541 | 5.04 | 6.25 |
| 3451 | Alcatel | Android | 15.80 | yes | no | 13.00 | 5.00 | 32.00 | 3.00 | 4000.00 | 165.00 | 2020 | 201 | 4.36 | 4.53 |
| 3452 | Alcatel | Android | 15.80 | yes | no | 13.00 | 5.00 | 32.00 | 2.00 | 4000.00 | 160.00 | 2020 | 149 | 4.35 | 4.62 |
| 3453 | Alcatel | Android | 12.83 | yes | no | 13.00 | 5.00 | 16.00 | 2.00 | 4000.00 | 168.00 | 2020 | 176 | 4.13 | 4.28 |
3454 rows × 15 columns
0
Update Weight
# Create a groupby for brand name, os and weight and calculate the median for non NaN main camera mp
df2["weight"] = df2["weight"].fillna(
value=df2.groupby(["os", "brand_name"])["weight"].transform("median")
)
df2
# See if there is still missing values
df2["weight"].isnull().sum()
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Honor | Android | 14.50 | yes | no | 13.00 | 5.00 | 64.00 | 3.00 | 3020.00 | 146.00 | 2020 | 127 | 4.31 | 4.72 |
| 1 | Honor | Android | 17.30 | yes | yes | 13.00 | 16.00 | 128.00 | 8.00 | 4300.00 | 213.00 | 2020 | 325 | 5.16 | 5.52 |
| 2 | Honor | Android | 16.69 | yes | yes | 13.00 | 8.00 | 128.00 | 8.00 | 4200.00 | 213.00 | 2020 | 162 | 5.11 | 5.88 |
| 3 | Honor | Android | 25.50 | yes | yes | 13.00 | 8.00 | 64.00 | 6.00 | 7250.00 | 480.00 | 2020 | 345 | 5.14 | 5.63 |
| 4 | Honor | Android | 15.32 | yes | no | 13.00 | 8.00 | 64.00 | 3.00 | 5000.00 | 185.00 | 2020 | 293 | 4.39 | 4.95 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3449 | Asus | Android | 15.34 | yes | no | 13.00 | 8.00 | 64.00 | 6.00 | 5000.00 | 190.00 | 2019 | 232 | 4.49 | 6.48 |
| 3450 | Asus | Android | 15.24 | yes | no | 13.00 | 8.00 | 128.00 | 8.00 | 4000.00 | 200.00 | 2018 | 541 | 5.04 | 6.25 |
| 3451 | Alcatel | Android | 15.80 | yes | no | 13.00 | 5.00 | 32.00 | 3.00 | 4000.00 | 165.00 | 2020 | 201 | 4.36 | 4.53 |
| 3452 | Alcatel | Android | 15.80 | yes | no | 13.00 | 5.00 | 32.00 | 2.00 | 4000.00 | 160.00 | 2020 | 149 | 4.35 | 4.62 |
| 3453 | Alcatel | Android | 12.83 | yes | no | 13.00 | 5.00 | 16.00 | 2.00 | 4000.00 | 168.00 | 2020 | 176 | 4.13 | 4.28 |
3454 rows × 15 columns
0
# Check for missing data
df2.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 0 selfie_camera_mp 0 int_memory 0 ram 0 battery 0 weight 0 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 dtype: int64
df2.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| screen_size | 3454.00 | 13.71 | 3.81 | 5.08 | 12.70 | 12.83 | 15.34 | 30.71 |
| main_camera_mp | 3454.00 | 9.91 | 7.11 | 0.08 | 5.00 | 10.00 | 13.00 | 108.00 |
| selfie_camera_mp | 3454.00 | 6.56 | 6.97 | 0.00 | 2.00 | 5.00 | 8.00 | 32.00 |
| int_memory | 3454.00 | 54.51 | 84.94 | 0.01 | 16.00 | 32.00 | 64.00 | 1024.00 |
| ram | 3454.00 | 4.03 | 1.37 | 0.02 | 4.00 | 4.00 | 4.00 | 12.00 |
| battery | 3454.00 | 3132.11 | 1299.43 | 500.00 | 2100.00 | 3000.00 | 4000.00 | 9720.00 |
| weight | 3454.00 | 182.64 | 88.36 | 69.00 | 142.00 | 160.00 | 185.00 | 855.00 |
| release_year | 3454.00 | 2015.97 | 2.30 | 2013.00 | 2014.00 | 2015.50 | 2018.00 | 2020.00 |
| days_used | 3454.00 | 674.87 | 248.58 | 91.00 | 533.50 | 690.50 | 868.75 | 1094.00 |
| normalized_used_price | 3454.00 | 4.36 | 0.59 | 1.54 | 4.03 | 4.41 | 4.76 | 6.62 |
| normalized_new_price | 3454.00 | 5.23 | 0.68 | 2.90 | 4.79 | 5.25 | 5.67 | 7.85 |
df2.describe(include="category").T # describe for non-numeric values
| count | unique | top | freq | |
|---|---|---|---|---|
| brand_name | 3454 | 34 | Others | 502 |
| os | 3454 | 4 | Android | 3214 |
| 4g | 3454 | 2 | yes | 2335 |
| 5g | 3454 | 2 | no | 3302 |
Univariate Analysis:
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(df2, feature, figsize=(12, 7), kde=False, bins=None):
"""
Creates a combined boxplot and histogram for a given feature in the dataset.
Args:
df2: The input dataframe.
feature (str): The column name for which to create the plot.
figsize (tuple, optional): Size of the figure (default: (12, 7)).
kde (bool, optional): Whether to show the density curve (default: False).
bins (int, optional): Number of bins for the histogram (default: None).
Returns:
None (displays the plot)
"""
fig, (ax_box, ax_hist) = plt.subplots(
nrows=2,
sharex=True,
figsize=figsize,
gridspec_kw={"height_ratios": (0.15, 0.85)},
)
# Boxplot
sns.boxplot(data=df2, x=feature, ax=ax_box, showmeans=True, color="#F72585")
# Histogram
if bins is None:
unique_values = df2[feature].unique()
bins = np.linspace(unique_values.min() - 1, unique_values.max() + 2, num=25)
sns.histplot(data=df2, x=feature, bins=bins, kde=True, ax=ax_hist)
# Add mean and median lines
ax_hist.axvline(df2[feature].mean(), color="purple", linestyle="--", label="Mean")
ax_hist.axvline(df2[feature].median(), color="blue", linestyle="-", label="Median")
# Label each bar with its count
for j, p in enumerate(ax_hist.patches):
ax_hist.annotate(
f"{int(p.get_height())}",
(p.get_x() + p.get_width() / 2.0, p.get_height()),
ha="center",
va="center",
xytext=(1, 10),
textcoords="offset points",
)
ax_hist.legend()
ax_hist.set_xlabel(feature)
ax_hist.set_ylabel("Frequency")
ax_hist.set_title(f"Frequency of {feature}")
plt.tight_layout()
Screen Size CM
# Create a figure with a specified size
plt.figure(figsize=(20, 6))
# Plot the histogram and boxplot
histogram_boxplot(df2, "screen_size")
# Set the x-axis label
plt.xlabel("Size of the screen in cm")
df2["screen_size"].value_counts()
print()
df2["screen_size"].describe().T
<Figure size 2000x600 with 0 Axes>
Text(0.5, 43.249999999999986, 'Size of the screen in cm')
screen_size
12.70 575
12.83 409
10.29 219
10.16 188
15.24 129
...
17.40 1
15.82 1
14.53 1
16.15 1
14.81 1
Name: count, Length: 142, dtype: int64
count 3454.00 mean 13.71 std 3.81 min 5.08 25% 12.70 50% 12.83 75% 15.34 max 30.71 Name: screen_size, dtype: float64
<Figure size 2000x600 with 0 Axes>
Distribution is right skewed.
~1321 Devices have an ~ size 13 screen.
There are several outliers.
Main Camera Megapixels
# Plot the histogram and boxplot
histogram_boxplot(df2, "main_camera_mp")
plt.xlabel("Mega Pixels of the Main Camera")
df2["main_camera_mp"].describe().T
Text(0.5, 43.249999999999986, 'Mega Pixels of the Main Camera')
count 3454.00 mean 9.91 std 7.11 min 0.08 25% 5.00 50% 10.00 75% 13.00 max 108.00 Name: main_camera_mp, dtype: float64
The mean is right under 10.
There are some outliers.
The highest amount of mega pixels is around 15. There are ~1374 devices showing this amount of mega pixels.
Appears to be right skewed.
Selfie Camera Megapixels
# Plot the histogram and boxplot
histogram_boxplot(df2, "selfie_camera_mp")
plt.xlabel("Mega Pixels of the Selfie Camera")
df2["selfie_camera_mp"].describe().T
Text(0.5, 43.249999999999986, 'Mega Pixels of the Selfie Camera')
count 3454.00 mean 6.56 std 6.97 min 0.00 25% 2.00 50% 5.00 75% 8.00 max 32.00 Name: selfie_camera_mp, dtype: float64
The mean is right under 7.
There are multiple outliers.
The highest amount of mega pixels is around 5. There are little less than 800 devices showing this amount of mega pixels.
Appears to be right skewed.
Internal Memory
# Plot the histogram and boxplot
histogram_boxplot(df2, "int_memory")
plt.xlabel("GB of Internal Memory")
df2["int_memory"].value_counts()
print()
df2["int_memory"].describe().T
Text(0.5, 43.249999999999986, 'GB of Internal Memory')
int_memory 16.00 1283 32.00 1083 64.00 509 128.00 372 256.00 86 512.00 44 0.06 20 0.10 19 8.00 12 4.00 10 1024.00 8 0.50 4 0.20 2 24.00 1 0.01 1 Name: count, dtype: int64
count 3454.00 mean 54.51 std 84.94 min 0.01 25% 16.00 50% 32.00 75% 64.00 max 1024.00 Name: int_memory, dtype: float64
Very right skewed.
1283 devices have a memory of 16gb and 1083 devices have a memory of 32gb.
There are multiple outliers.
Ram
# Plot the histogram and boxplot
histogram_boxplot(df2, "ram")
plt.xlabel("GB of Ram")
df2["ram"].value_counts()
print()
df2["ram"].describe().T
Text(0.5, 43.249999999999986, 'GB of Ram')
ram 4.00 2815 6.00 154 8.00 130 2.00 90 0.25 83 3.00 81 1.00 34 0.03 20 0.02 19 12.00 18 0.50 9 1.50 1 Name: count, dtype: int64
count 3454.00 mean 4.03 std 1.37 min 0.02 25% 4.00 50% 4.00 75% 4.00 max 12.00 Name: ram, dtype: float64
Most devices are showing ram of 4. There are 2815 devices that have a ram of 4.
As a result there are several outliers.
Battery
# Plot the histogram and boxplot
histogram_boxplot(df2, "battery")
plt.xlabel("Energy Capacity of device in mAh")
df2["battery"].value_counts()
print()
df2["battery"].describe().T
Text(0.5, 43.249999999999986, 'Energy Capacity of device in mAh')
battery
4000.00 341
3000.00 314
2000.00 244
2500.00 138
2100.00 121
...
6180.00 1
4180.00 1
2180.00 1
2880.00 1
3110.00 1
Name: count, Length: 324, dtype: int64
count 3454.00 mean 3132.11 std 1299.43 min 500.00 25% 2100.00 50% 3000.00 75% 4000.00 max 9720.00 Name: battery, dtype: float64
There are several outliers.
Distribution is right skewed.
~1002 devices have a battery of 3000 - 4000 mAh.
Weight
# Plot the histogram and boxplot
histogram_boxplot(df2, "weight")
plt.xlabel("Weight of device in grams")
df2["weight"].value_counts()
print()
df2["weight"].describe().T
Text(0.5, 43.249999999999986, 'Weight of device in grams')
weight
150.00 112
140.00 86
160.00 80
155.00 68
145.00 68
...
157.50 1
372.00 1
340.20 1
319.80 1
240.00 1
Name: count, Length: 555, dtype: int64
count 3454.00 mean 182.64 std 88.36 min 69.00 25% 142.00 50% 160.00 75% 185.00 max 855.00 Name: weight, dtype: float64
Several outliers
Distribution is right skewed.
Highest percentage of devices have a weight range from 140 - 160 gram.
Days Used
# Plot the histogram and boxplot
histogram_boxplot(df2, "days_used")
plt.xlabel("Number of days the used/refurbished device has been used")
df2["days_used"].value_counts()
print()
df2["days_used"].describe().T
Text(0.5, 43.249999999999986, 'Number of days the used/refurbished device has been used')
days_used
578 13
601 13
564 13
682 12
683 12
..
245 1
253 1
269 1
307 1
149 1
Name: count, Length: 924, dtype: int64
count 3454.00 mean 674.87 std 248.58 min 91.00 25% 533.50 50% 690.50 75% 868.75 max 1094.00 Name: days_used, dtype: float64
There are no outliers.
Distribution is left skewed.
Based on the chart ~250 devices were used for ~600 days.
Normalized New Price
# Plot the histogram and boxplot
histogram_boxplot(df2, "normalized_new_price")
plt.xlabel("Normalized price of the new devices in Euros")
df2["normalized_new_price"].value_counts()
print()
df2["normalized_new_price"].describe().T
Text(0.5, 43.249999999999986, 'Normalized price of the new devices in Euros')
normalized_new_price
5.14 6
4.78 5
5.11 4
5.30 4
5.30 4
..
4.96 1
4.24 1
4.79 1
6.00 1
4.28 1
Name: count, Length: 2988, dtype: int64
count 3454.00 mean 5.23 std 0.68 min 2.90 25% 4.79 50% 5.25 75% 5.67 max 7.85 Name: normalized_new_price, dtype: float64
There are several outliers.
Looks to be a normal distribution.
680 devices have ~ a cost of 5.25 Euros.
Normalized Used Price
# Plot the histogram and boxplot
histogram_boxplot(df2, "normalized_used_price")
plt.xlabel("Normalized price of the used devices in Euros")
print()
df2["normalized_used_price"].describe().T
Text(0.5, 43.249999999999986, 'Normalized price of the used devices in Euros')
count 3454.00 mean 4.36 std 0.59 min 1.54 25% 4.03 50% 4.41 75% 4.76 max 6.62 Name: normalized_used_price, dtype: float64
1. What does the distribution of normalized used device prices look like?
There are several outliers.
Looks to be a normal distribution.
Majority of the devices have a cost between 3.5 and 5.5 Euros.
Categorical Variables
# function to create labeled barplots
def labeled_barplot(df2, feature, perc=False, n=None):
"""
Barplot with percentage at the top
df2: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(df2[feature]) # length of the column
count = df2[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 10))
else:
plt.figure(figsize=(n + 1, 10))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=df2,
x=feature, # Assign the x variable to hue
palette="cubehelix", # Set the hue to the same variable
legend=False, # Disable the legend
order=df2[feature].value_counts().index[:n].sort_values(),
)
# Annotate each bar with its count and percentage
for p in ax.patches:
prc = "{:.1f}%".format(100.0 * p.get_height() / total) # percentage
cnt = p.get_height() # count
xx = p.get_x() + p.get_width() / 2 # x coordinate of bar percentage label
yy = p.get_height() # y coordinate of bar percentage label
# Annotate percentage
ax.annotate(
prc,
(xx, yy),
ha="center",
va="center",
style="italic",
size=12,
xytext=(0, 10),
textcoords="offset points",
)
# Annotate count (adjust vertical position)
ax.annotate(
cnt,
(xx, yy + 5),
ha="center",
va="bottom", # Adjusted to display above the percentage label
size=12,
xytext=(0, 20),
textcoords="offset points",
)
# Increase y-axis size by 100
plt.ylim(0, ax.get_ylim()[1] + 100)
Brand Name
# Labeled barplot for brand name
labeled_barplot(df, "brand_name", perc=True, n=25)
<ipython-input-1165-65ce46487b3d>:21: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax = sns.countplot(
502 devices (14.5%) brand is coded as Others.
The next highest device count is for brand name Samsung. Samusung has 341 devices and makes up 9.9% of all the devices.
Release Year
# Labeled barplot for release year
labeled_barplot(df, "release_year", perc=True, n=25)
<ipython-input-1165-65ce46487b3d>:21: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax = sns.countplot(
The biggest release year is 2014. 642 devices (18.6%) were released in 2014.
The next biggest release year is 2013. 2013 had 570 devices (16.5%).
Release years range from 2013 to 2020
Operating System
# Labeled barplot for os
labeled_barplot(df, "os", perc=True, n=25)
<ipython-input-1165-65ce46487b3d>:21: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax = sns.countplot(
2. What percentage of the used device market is dominated by Android devices?
Most of the devices are Androids.
3214 devices (93.1%) are Androids.
4G
# Labeled barplot for 4g
labeled_barplot(df, "4g", perc=True, n=25)
<ipython-input-1165-65ce46487b3d>:21: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax = sns.countplot(
2335 devices (67.6%) have 4G.
1119 devices (32.4%) do not have 4G.
5G
# Labeled barplot for 5g
labeled_barplot(df, "5g", perc=True, n=25)
<ipython-input-1165-65ce46487b3d>:21: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax = sns.countplot(
Most devices do not have 5G.
Only 152 devices (4.4%) have 5G.
3302 devices (95.6%) do not have 5G.
Most of the devices are Androids.
3214 devices (93.1%) are Androids.
Bivariate Analysis:
heatmap_list = df2.select_dtypes(include=np.number).columns.tolist()
# dropping release_year as it is a temporal variable.
plt.figure(figsize=(15, 7))
sns.heatmap(
df2[heatmap_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="cool"
)
plt.show()
<Figure size 1500x700 with 0 Axes>
<Axes: >
Brand Name and Ram
# Create the bar plot
plt.figure(figsize=(25, 6))
sns.barplot(x="brand_name", y="ram", data=df, hue="brand_name")
plt.title("Rams per Brand")
plt.xlabel("Brand")
plt.ylabel("Rams")
plt.xticks(rotation=90)
plt.show()
<Figure size 2500x600 with 0 Axes>
<Axes: xlabel='brand_name', ylabel='ram'>
Text(0.5, 1.0, 'Rams per Brand')
Text(0.5, 0, 'Brand')
Text(0, 0.5, 'Rams')
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33], [Text(0, 0, 'Honor'), Text(1, 0, 'Others'), Text(2, 0, 'HTC'), Text(3, 0, 'Huawei'), Text(4, 0, 'Infinix'), Text(5, 0, 'Lava'), Text(6, 0, 'Lenovo'), Text(7, 0, 'LG'), Text(8, 0, 'Meizu'), Text(9, 0, 'Micromax'), Text(10, 0, 'Motorola'), Text(11, 0, 'Nokia'), Text(12, 0, 'OnePlus'), Text(13, 0, 'Oppo'), Text(14, 0, 'Realme'), Text(15, 0, 'Samsung'), Text(16, 0, 'Vivo'), Text(17, 0, 'Xiaomi'), Text(18, 0, 'ZTE'), Text(19, 0, 'Apple'), Text(20, 0, 'Asus'), Text(21, 0, 'Coolpad'), Text(22, 0, 'Acer'), Text(23, 0, 'Alcatel'), Text(24, 0, 'BlackBerry'), Text(25, 0, 'Celkon'), Text(26, 0, 'Gionee'), Text(27, 0, 'Google'), Text(28, 0, 'Karbonn'), Text(29, 0, 'Microsoft'), Text(30, 0, 'Panasonic'), Text(31, 0, 'Sony'), Text(32, 0, 'Spice'), Text(33, 0, 'XOLO')])
# Create a groupby that includes brand name and the mean ram
average_ram_per_brand_name = df2.groupby("brand_name")["ram"].mean()
average_ram_per_brand_name.head()
print()
print()
plt.figure(figsize=(25, 6))
# specify colors
c = "red"
d = "blue"
e = "purple"
f = "teal"
# Create boxplot for brand name and ram
sns.boxplot(
data=df2,
y="ram",
x="brand_name",
notch=False,
patch_artist=True,
boxprops=dict(facecolor=c, color=c),
capprops=dict(color=d),
whiskerprops=dict(color=e),
flierprops=dict(color=f, markeredgecolor=f),
medianprops=dict(color=d),
)
plt.xticks(rotation=90)
brand_name Acer 3.90 Alcatel 3.41 Apple 4.00 Asus 4.05 BlackBerry 3.83 Name: ram, dtype: float64
<Figure size 2500x600 with 0 Axes>
/usr/local/lib/python3.10/dist-packages/seaborn/categorical.py:699: UserWarning: Setting the 'color' property will override the edgecolor or facecolor properties. artists = ax.bxp(**boxplot_kws)
<Axes: xlabel='brand_name', ylabel='ram'>
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33], [Text(0, 0, 'Acer'), Text(1, 0, 'Alcatel'), Text(2, 0, 'Apple'), Text(3, 0, 'Asus'), Text(4, 0, 'BlackBerry'), Text(5, 0, 'Celkon'), Text(6, 0, 'Coolpad'), Text(7, 0, 'Gionee'), Text(8, 0, 'Google'), Text(9, 0, 'HTC'), Text(10, 0, 'Honor'), Text(11, 0, 'Huawei'), Text(12, 0, 'Infinix'), Text(13, 0, 'Karbonn'), Text(14, 0, 'LG'), Text(15, 0, 'Lava'), Text(16, 0, 'Lenovo'), Text(17, 0, 'Meizu'), Text(18, 0, 'Micromax'), Text(19, 0, 'Microsoft'), Text(20, 0, 'Motorola'), Text(21, 0, 'Nokia'), Text(22, 0, 'OnePlus'), Text(23, 0, 'Oppo'), Text(24, 0, 'Others'), Text(25, 0, 'Panasonic'), Text(26, 0, 'Realme'), Text(27, 0, 'Samsung'), Text(28, 0, 'Sony'), Text(29, 0, 'Spice'), Text(30, 0, 'Vivo'), Text(31, 0, 'XOLO'), Text(32, 0, 'Xiaomi'), Text(33, 0, 'ZTE')])
3. The amount of RAM is important for the smooth functioning of a device. How does the amount of RAM vary with the brand?
OnePlus devices have the most ram compared to the other phones.
Most of the devices have average ram of ~4GB.
Celkon has the lowest averge ram of ~1.5GB.
Weight and Brand Name
# Create the bar plot
plt.figure(figsize=(25, 6))
sns.barplot(x="brand_name", y="weight", data=df, hue="brand_name")
plt.title("Weight per Brand")
plt.xlabel("Brand")
plt.ylabel("Weight")
plt.xticks(rotation=90)
plt.show()
<Figure size 2500x600 with 0 Axes>
<Axes: xlabel='brand_name', ylabel='weight'>
Text(0.5, 1.0, 'Weight per Brand')
Text(0.5, 0, 'Brand')
Text(0, 0.5, 'Weight')
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33], [Text(0, 0, 'Honor'), Text(1, 0, 'Others'), Text(2, 0, 'HTC'), Text(3, 0, 'Huawei'), Text(4, 0, 'Infinix'), Text(5, 0, 'Lava'), Text(6, 0, 'Lenovo'), Text(7, 0, 'LG'), Text(8, 0, 'Meizu'), Text(9, 0, 'Micromax'), Text(10, 0, 'Motorola'), Text(11, 0, 'Nokia'), Text(12, 0, 'OnePlus'), Text(13, 0, 'Oppo'), Text(14, 0, 'Realme'), Text(15, 0, 'Samsung'), Text(16, 0, 'Vivo'), Text(17, 0, 'Xiaomi'), Text(18, 0, 'ZTE'), Text(19, 0, 'Apple'), Text(20, 0, 'Asus'), Text(21, 0, 'Coolpad'), Text(22, 0, 'Acer'), Text(23, 0, 'Alcatel'), Text(24, 0, 'BlackBerry'), Text(25, 0, 'Celkon'), Text(26, 0, 'Gionee'), Text(27, 0, 'Google'), Text(28, 0, 'Karbonn'), Text(29, 0, 'Microsoft'), Text(30, 0, 'Panasonic'), Text(31, 0, 'Sony'), Text(32, 0, 'Spice'), Text(33, 0, 'XOLO')])
Battery and Brand Name
# Create the bar plot
plt.figure(figsize=(25, 6))
sns.barplot(x="brand_name", y="battery", data=df, hue="brand_name")
plt.title("Battery per Brand")
plt.xlabel("Brand")
plt.ylabel("Battery")
plt.xticks(rotation=90)
plt.show()
<Figure size 2500x600 with 0 Axes>
<Axes: xlabel='brand_name', ylabel='battery'>
Text(0.5, 1.0, 'Battery per Brand')
Text(0.5, 0, 'Brand')
Text(0, 0.5, 'Battery')
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33], [Text(0, 0, 'Honor'), Text(1, 0, 'Others'), Text(2, 0, 'HTC'), Text(3, 0, 'Huawei'), Text(4, 0, 'Infinix'), Text(5, 0, 'Lava'), Text(6, 0, 'Lenovo'), Text(7, 0, 'LG'), Text(8, 0, 'Meizu'), Text(9, 0, 'Micromax'), Text(10, 0, 'Motorola'), Text(11, 0, 'Nokia'), Text(12, 0, 'OnePlus'), Text(13, 0, 'Oppo'), Text(14, 0, 'Realme'), Text(15, 0, 'Samsung'), Text(16, 0, 'Vivo'), Text(17, 0, 'Xiaomi'), Text(18, 0, 'ZTE'), Text(19, 0, 'Apple'), Text(20, 0, 'Asus'), Text(21, 0, 'Coolpad'), Text(22, 0, 'Acer'), Text(23, 0, 'Alcatel'), Text(24, 0, 'BlackBerry'), Text(25, 0, 'Celkon'), Text(26, 0, 'Gionee'), Text(27, 0, 'Google'), Text(28, 0, 'Karbonn'), Text(29, 0, 'Microsoft'), Text(30, 0, 'Panasonic'), Text(31, 0, 'Sony'), Text(32, 0, 'Spice'), Text(33, 0, 'XOLO')])
# Create a groupby that includes brand name and the mean weight
average_weight_per_brand_name = df2.groupby("brand_name")["weight"].mean()
average_weight_per_brand_name.head()
print()
print()
plt.figure(figsize=(25, 6))
# specify colors
c = "red"
d = "blue"
e = "purple"
f = "teal"
# Create boxplot for brand name and ram
sns.boxplot(
data=df2,
y="weight",
x="brand_name",
notch=False,
patch_artist=True,
boxprops=dict(facecolor=c, color=c),
capprops=dict(color=d),
whiskerprops=dict(color=e),
flierprops=dict(color=f, markeredgecolor=f),
medianprops=dict(color=d),
)
plt.xticks(rotation=90)
brand_name Acer 225.87 Alcatel 169.01 Apple 320.42 Asus 218.60 BlackBerry 160.57 Name: weight, dtype: float64
<Figure size 2500x600 with 0 Axes>
/usr/local/lib/python3.10/dist-packages/seaborn/categorical.py:699: UserWarning: Setting the 'color' property will override the edgecolor or facecolor properties. artists = ax.bxp(**boxplot_kws)
<Axes: xlabel='brand_name', ylabel='weight'>
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33], [Text(0, 0, 'Acer'), Text(1, 0, 'Alcatel'), Text(2, 0, 'Apple'), Text(3, 0, 'Asus'), Text(4, 0, 'BlackBerry'), Text(5, 0, 'Celkon'), Text(6, 0, 'Coolpad'), Text(7, 0, 'Gionee'), Text(8, 0, 'Google'), Text(9, 0, 'HTC'), Text(10, 0, 'Honor'), Text(11, 0, 'Huawei'), Text(12, 0, 'Infinix'), Text(13, 0, 'Karbonn'), Text(14, 0, 'LG'), Text(15, 0, 'Lava'), Text(16, 0, 'Lenovo'), Text(17, 0, 'Meizu'), Text(18, 0, 'Micromax'), Text(19, 0, 'Microsoft'), Text(20, 0, 'Motorola'), Text(21, 0, 'Nokia'), Text(22, 0, 'OnePlus'), Text(23, 0, 'Oppo'), Text(24, 0, 'Others'), Text(25, 0, 'Panasonic'), Text(26, 0, 'Realme'), Text(27, 0, 'Samsung'), Text(28, 0, 'Sony'), Text(29, 0, 'Spice'), Text(30, 0, 'Vivo'), Text(31, 0, 'XOLO'), Text(32, 0, 'Xiaomi'), Text(33, 0, 'ZTE')])
Apple devices weight the most.
Most devices have outliers.
4. A large battery often increases a device's weight, making it feel uncomfortable in the hands. How does the weight vary for phones and tablets offering large batteries (more than 4500 mAh)?
df2.isnull().sum()
# Convert brand_name to object and the weight into int (just in case):
greater_than_4500_battery = df2.loc[df2["battery"] > 4500]
greater_than_4500_battery["brand_name"] = greater_than_4500_battery[
"brand_name"
].astype("object")
greater_than_4500_battery["weight"] = greater_than_4500_battery["weight"].astype("int")
# In the groupby, add dropna=True so it can handle any weird stuff it does not like in your dataframe:
result = df2.groupby("brand_name", dropna=False).agg({"weight": "median"})
result
print()
greater_than_4500_battery.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 0 selfie_camera_mp 0 int_memory 0 ram 0 battery 0 weight 0 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 dtype: int64
<ipython-input-1177-1b2d3324e6cd>:7: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
greater_than_4500_battery["brand_name"] = greater_than_4500_battery[
<ipython-input-1177-1b2d3324e6cd>:11: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
greater_than_4500_battery["weight"] = greater_than_4500_battery["weight"].astype("int")
| weight | |
|---|---|
| brand_name | |
| Acer | 155.00 |
| Alcatel | 149.00 |
| Apple | 299.00 |
| Asus | 170.00 |
| BlackBerry | 167.50 |
| Celkon | 140.00 |
| Coolpad | 155.00 |
| Gionee | 151.50 |
| 167.00 | |
| HTC | 155.00 |
| Honor | 171.75 |
| Huawei | 175.00 |
| Infinix | 185.00 |
| Karbonn | 126.50 |
| LG | 152.00 |
| Lava | 137.00 |
| Lenovo | 165.40 |
| Meizu | 159.50 |
| Micromax | 143.00 |
| Microsoft | 146.00 |
| Motorola | 166.00 |
| Nokia | 145.00 |
| OnePlus | 180.00 |
| Oppo | 169.50 |
| Others | 154.50 |
| Panasonic | 146.70 |
| Realme | 195.00 |
| Samsung | 165.00 |
| Sony | 154.50 |
| Spice | 158.00 |
| Vivo | 167.80 |
| XOLO | 132.00 |
| Xiaomi | 180.00 |
| ZTE | 157.50 |
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 0 selfie_camera_mp 0 int_memory 0 ram 0 battery 0 weight 0 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 dtype: int64
Not all phones have a battery greater than 4500.
Apple has the highest weight at an average of 299 grams.
Karbonn has the loweist weight at an average of 126.50 grams.
5. Bigger screens are desirable for entertainment purposes as they offer a better viewing experience. How many phones and tablets are available across different brands with a screen size larger than 6 inches?
screen_greater_than_6 = df2[
df2.screen_size > 6 * 2.54
] # data set of >6 inches screen size, the multiplication of 2.54 converts the cm to inches (15.24 cm = ~6 inches)
screen_greater_than_6.shape
(1099, 15)
screen_greater_than_6.brand_name.value_counts().sort_values(
ascending=False
) # printing the screens larger than 6 inches on descending order
brand_name Huawei 149 Samsung 119 Others 99 Vivo 80 Honor 72 Oppo 70 Lenovo 69 Xiaomi 69 LG 59 Motorola 42 Asus 41 Realme 40 Alcatel 26 Apple 24 Acer 19 Meizu 17 ZTE 17 OnePlus 16 Nokia 15 Sony 12 Infinix 10 Micromax 7 HTC 7 Google 4 XOLO 3 Gionee 3 Coolpad 3 Spice 2 Panasonic 2 Karbonn 2 Microsoft 1 Celkon 0 BlackBerry 0 Lava 0 Name: count, dtype: int64
labeled_barplot(
df2=screen_greater_than_6, feature="brand_name", perc=True, n=34)
# plotting screens bigger than 6 inches using barplot.
<ipython-input-1165-65ce46487b3d>:21: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax = sns.countplot(
1099 phones have a screen size greater than 6.
All but 3 brands offer screens higher than 6. The three brands are BlackBerry, Celkon and Lava.
The most devices come from the brand Huawei (149 devices) and Samsung (119 devices)
Featured Engineering
# Create a device cost range
df2["device_cost_range"] = pd.cut(
x=df2.normalized_new_price,
bins=[-np.infty, 5.25, 6.00, np.infty],
labels=["Budget", "Mid-Range", "Premium"],
)
df2[
"device_cost_range"
].value_counts() # creating the device category based on the price values
device_cost_range Budget 1753 Mid-Range 1310 Premium 391 Name: count, dtype: int64
labeled_barplot(df2, "device_cost_range", perc=True) # plotting the device cost range
<ipython-input-1165-65ce46487b3d>:21: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax = sns.countplot(
50.8% of all devices have a price of 5.25 or less Euros.
37.9% of all devices have a price between 5.25 and 6.00 Euros.
11.3% of all devices have a price greater than 6.00 Euros.
6. A lot of devices nowadays offer great selfie cameras, allowing us to capture our favorite moments with loved ones. What is the distribution of devices offering greater than 8MP selfie cameras across brands?
selfie_camera_greater_than_8 = df2[
df2.selfie_camera_mp > 8
] # data set of >8 mp selfie cameras
selfie_camera_greater_than_8.shape
(655, 16)
selfie_camera_greater_than_8.brand_name.value_counts().sort_values(
ascending=False
) # printing the selfie cameras bigger than 8 mp in descending order
brand_name Huawei 87 Vivo 78 Oppo 75 Xiaomi 63 Samsung 57 Honor 41 Others 34 LG 32 Motorola 26 Meizu 24 ZTE 20 HTC 20 Realme 18 OnePlus 18 Sony 14 Lenovo 14 Nokia 10 Asus 6 Gionee 4 Infinix 4 Coolpad 3 Micromax 2 Panasonic 2 BlackBerry 2 Acer 1 Apple 0 XOLO 0 Spice 0 Google 0 Karbonn 0 Lava 0 Microsoft 0 Alcatel 0 Celkon 0 Name: count, dtype: int64
labeled_barplot(
df2=selfie_camera_greater_than_8, feature="brand_name", perc=True, n=34
) # plotting selfie cameras bigger than 8 MP using barplot.
<ipython-input-1165-65ce46487b3d>:21: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax = sns.countplot(
plt.figure(figsize=(20, 10))
sns.countplot(
data=selfie_camera_greater_than_8, x="brand_name", hue="device_cost_range"
) ## countplot for 'brandname' with hue = 'device_cost_range' on selfie_camera_greater_than_8
plt.xticks(rotation=90)
plt.legend(loc=1)
plt.show()
<Figure size 2000x1000 with 0 Axes>
<Axes: xlabel='brand_name', ylabel='count'>
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33], [Text(0, 0, 'Acer'), Text(1, 0, 'Alcatel'), Text(2, 0, 'Apple'), Text(3, 0, 'Asus'), Text(4, 0, 'BlackBerry'), Text(5, 0, 'Celkon'), Text(6, 0, 'Coolpad'), Text(7, 0, 'Gionee'), Text(8, 0, 'Google'), Text(9, 0, 'HTC'), Text(10, 0, 'Honor'), Text(11, 0, 'Huawei'), Text(12, 0, 'Infinix'), Text(13, 0, 'Karbonn'), Text(14, 0, 'LG'), Text(15, 0, 'Lava'), Text(16, 0, 'Lenovo'), Text(17, 0, 'Meizu'), Text(18, 0, 'Micromax'), Text(19, 0, 'Microsoft'), Text(20, 0, 'Motorola'), Text(21, 0, 'Nokia'), Text(22, 0, 'OnePlus'), Text(23, 0, 'Oppo'), Text(24, 0, 'Others'), Text(25, 0, 'Panasonic'), Text(26, 0, 'Realme'), Text(27, 0, 'Samsung'), Text(28, 0, 'Sony'), Text(29, 0, 'Spice'), Text(30, 0, 'Vivo'), Text(31, 0, 'XOLO'), Text(32, 0, 'Xiaomi'), Text(33, 0, 'ZTE')])
<matplotlib.legend.Legend at 0x7f62429a9a80>
# Calculate brand counts by range
brand_counts_by_range = (
selfie_camera_greater_than_8.groupby(["brand_name", "device_cost_range"])
.size()
.reset_index(name="counts")
)
brand_counts_by_range.head()
| brand_name | device_cost_range | counts | |
|---|---|---|---|
| 0 | Acer | Budget | 0 |
| 1 | Acer | Mid-Range | 1 |
| 2 | Acer | Premium | 0 |
| 3 | Alcatel | Budget | 0 |
| 4 | Alcatel | Mid-Range | 0 |
# Calculate brand counts
brand_counts = selfie_camera_greater_than_8["brand_name"].value_counts()
brand_counts
brand_name Huawei 87 Vivo 78 Oppo 75 Xiaomi 63 Samsung 57 Honor 41 Others 34 LG 32 Motorola 26 Meizu 24 ZTE 20 HTC 20 Realme 18 OnePlus 18 Sony 14 Lenovo 14 Nokia 10 Asus 6 Gionee 4 Infinix 4 Coolpad 3 Micromax 2 Panasonic 2 BlackBerry 2 Acer 1 Apple 0 XOLO 0 Spice 0 Google 0 Karbonn 0 Lava 0 Microsoft 0 Alcatel 0 Celkon 0 Name: count, dtype: int64
# Step 1: Calculate the total count for each 'brand_name'
total_counts = (
selfie_camera_greater_than_8.groupby("brand_name")
.size()
.reset_index(name="total_counts")
)
# Step 2: Group by both 'brand_name' and 'device_cost_range' and calculate counts
brand_counts_by_range = (
selfie_camera_greater_than_8.groupby(["brand_name", "device_cost_range"])
.size()
.reset_index(name="counts")
)
# Step 3: Merge the total counts with the grouped counts
brand_counts_by_range = brand_counts_by_range.merge(total_counts, on="brand_name")
# Calculate percentage
brand_counts_by_range["percentage"] = (
brand_counts_by_range["counts"] / brand_counts_by_range["total_counts"]
) * 100
brand_counts_by_range
| brand_name | device_cost_range | counts | total_counts | percentage | |
|---|---|---|---|---|---|
| 0 | Acer | Budget | 0 | 1 | 0.00 |
| 1 | Acer | Mid-Range | 1 | 1 | 100.00 |
| 2 | Acer | Premium | 0 | 1 | 0.00 |
| 3 | Alcatel | Budget | 0 | 0 | NaN |
| 4 | Alcatel | Mid-Range | 0 | 0 | NaN |
| ... | ... | ... | ... | ... | ... |
| 97 | Xiaomi | Mid-Range | 28 | 63 | 44.44 |
| 98 | Xiaomi | Premium | 20 | 63 | 31.75 |
| 99 | ZTE | Budget | 3 | 20 | 15.00 |
| 100 | ZTE | Mid-Range | 11 | 20 | 55.00 |
| 101 | ZTE | Premium | 6 | 20 | 30.00 |
102 rows × 5 columns
# Get the top 5 brands
top_brands = brand_counts_by_range.nlargest(10, "counts")
total_brands = total_counts.nlargest(10, "total_counts")
# Filter the original dataframe to include only the top 5 brands
top_brands_df = selfie_camera_greater_than_8[
selfie_camera_greater_than_8["brand_name"].isin(top_brands["brand_name"])
]
total_brand_10_df = selfie_camera_greater_than_8[
selfie_camera_greater_than_8["brand_name"].isin(total_brands["brand_name"])
]
top_brands
print()
total_brands
| brand_name | device_cost_range | counts | total_counts | percentage | |
|---|---|---|---|---|---|
| 34 | Huawei | Mid-Range | 46 | 87 | 52.87 |
| 91 | Vivo | Mid-Range | 46 | 78 | 58.97 |
| 70 | Oppo | Mid-Range | 37 | 75 | 49.33 |
| 71 | Oppo | Premium | 36 | 75 | 48.00 |
| 35 | Huawei | Premium | 33 | 87 | 37.93 |
| 31 | Honor | Mid-Range | 30 | 41 | 73.17 |
| 97 | Xiaomi | Mid-Range | 28 | 63 | 44.44 |
| 73 | Others | Mid-Range | 26 | 34 | 76.47 |
| 82 | Samsung | Mid-Range | 26 | 57 | 45.61 |
| 92 | Vivo | Premium | 25 | 78 | 32.05 |
| brand_name | total_counts | |
|---|---|---|
| 11 | Huawei | 87 |
| 30 | Vivo | 78 |
| 23 | Oppo | 75 |
| 32 | Xiaomi | 63 |
| 27 | Samsung | 57 |
| 10 | Honor | 41 |
| 24 | Others | 34 |
| 14 | LG | 32 |
| 20 | Motorola | 26 |
| 17 | Meizu | 24 |
# Create the bar plot
plt.figure(figsize=(20, 15))
ax = sns.countplot(
data=total_brand_10_df,
x="brand_name",
hue="device_cost_range",
order=top_brands["brand_name"],
)
plt.xlabel("Top 7 brands with selfie cameras > 8MP")
plt.ylabel("Device Count")
plt.title("Top 7 brands with Selfie Cameras > 8MP by Device Count")
plt.ylim(0, 50) # Increase the height of the y-axis
# Iterate over each bar patch to add count labels
for p in ax.patches:
height = p.get_height()
ax.annotate(
format(height, ".0f"),
(p.get_x() + p.get_width() / 2.0, height),
ha="center",
va="center",
xytext=(0, 5), # Adjust this value
textcoords="offset points",
fontsize=10, # Customize the font size
)
<Figure size 2000x1500 with 0 Axes>
Text(0.5, 0, 'Top 7 brands with selfie cameras > 8MP')
Text(0, 0.5, 'Device Count')
Text(0.5, 1.0, 'Top 7 brands with Selfie Cameras > 8MP by Device Count')
(0.0, 50.0)
Text(0, 5, '8')
Text(0, 5, '7')
Text(0, 5, '2')
Text(0, 5, '6')
Text(0, 5, '15')
Text(0, 5, '3')
Text(0, 5, '7')
Text(0, 5, '46')
Text(0, 5, '46')
Text(0, 5, '37')
Text(0, 5, '30')
Text(0, 5, '28')
Text(0, 5, '26')
Text(0, 5, '26')
Text(0, 5, '33')
Text(0, 5, '25')
Text(0, 5, '36')
Text(0, 5, '5')
Text(0, 5, '20')
Text(0, 5, '5')
Text(0, 5, '24')
Text(0, 5, '0')
Text(0, 5, '0')
Text(0, 5, '0')
655 devices have a selfie camera with more than 8mp.
Huwaii has the most devices 87 (13.3%) that have a selfie camera more than 8mp. 46 are mid_range devices.
Vivo has the 2nd most devices 78 (11.9%) that have a selfie camera more than 8mp. 46 are mid_range devices.
Oppo has the 3rd most devices 75 (11.5%) that have a selfie camera more than 8mp. 37 are mid-range deivces and 36 are premium devices.
7. Which attributes are highly correlated with the normalized price of a used device?
heatmap_list = df2.select_dtypes(include=np.number).columns.tolist()
# Calculate the correlation matrix
corr_matrix = df2[heatmap_list].corr()
# Create the heatmap
plt.figure(figsize=(15, 7))
sns.heatmap(
corr_matrix, annot=True, vmin=-1, vmax=1, cmap="hsv", fmt=".2f", linewidths=0.05
)
plt.title("Correlation Heatmap")
plt.show()
<Figure size 1500x700 with 0 Axes>
<Axes: >
Text(0.5, 1.0, 'Correlation Heatmap')
normalized_used_price is .83 positively correlated with normalized_new_price. This means the higher the normalized_new_price the higher the normalized_used_price tends to increase.
normalized_used_price is .61 positively correlated with screen_size, selfie_camera_mp, and battery. This has a moderate impact on normalized_used_price.
normalized_used_price is .52 positively correlated with ram. This has a moderate impact on normalized_used_price.
normalized_used_price is .40 positively correlated with main_camera_mp. This has between and low and moderate impact on normalized_used_price.
normalized_used_price is .38 positively correlated with weight. This has between and low and moderate impact on normalized_used_price.
normalized_used_price is .19 positively correlated with weight. This is the least correlated column to normalized_used_price. It has a low impact on normalized_used_price.
normalized_used_price is .36 negatively correlated with days_used. This has between and low and moderate impact on normalized_used_price.
Outlier Check
Outlier Check on Screen size:
histogram_boxplot(df2, "screen_size") # box_plot for main_camera_mp
df2[df2["screen_size"]>20].describe()
| screen_size | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 247.00 | 247.00 | 247.00 | 247.00 | 247.00 | 247.00 | 247.00 | 247.00 | 247.00 | 247.00 | 247.00 |
| mean | 23.30 | 6.89 | 3.50 | 49.20 | 3.91 | 5682.18 | 431.31 | 2015.81 | 682.99 | 4.82 | 5.61 |
| std | 2.91 | 3.23 | 3.28 | 111.09 | 1.07 | 1842.17 | 140.08 | 2.33 | 242.31 | 0.40 | 0.55 |
| min | 20.12 | 0.30 | 0.30 | 8.00 | 0.50 | 1200.00 | 118.00 | 2013.00 | 129.00 | 3.83 | 4.39 |
| 25% | 20.32 | 5.00 | 1.60 | 16.00 | 4.00 | 4435.00 | 329.00 | 2014.00 | 546.00 | 4.56 | 5.30 |
| 50% | 23.04 | 8.00 | 2.00 | 32.00 | 4.00 | 5200.00 | 439.00 | 2015.00 | 696.00 | 4.78 | 5.56 |
| 75% | 25.43 | 8.00 | 5.00 | 32.00 | 4.00 | 7040.00 | 508.00 | 2018.00 | 886.50 | 5.01 | 5.86 |
| max | 30.71 | 13.00 | 24.00 | 1024.00 | 12.00 | 9720.00 | 855.00 | 2020.00 | 1089.00 | 6.62 | 7.85 |
df2[df2["screen_size"]>20].describe()
| screen_size | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 247.00 | 247.00 | 247.00 | 247.00 | 247.00 | 247.00 | 247.00 | 247.00 | 247.00 | 247.00 | 247.00 |
| mean | 23.30 | 6.89 | 3.50 | 49.20 | 3.91 | 5682.18 | 431.31 | 2015.81 | 682.99 | 4.82 | 5.61 |
| std | 2.91 | 3.23 | 3.28 | 111.09 | 1.07 | 1842.17 | 140.08 | 2.33 | 242.31 | 0.40 | 0.55 |
| min | 20.12 | 0.30 | 0.30 | 8.00 | 0.50 | 1200.00 | 118.00 | 2013.00 | 129.00 | 3.83 | 4.39 |
| 25% | 20.32 | 5.00 | 1.60 | 16.00 | 4.00 | 4435.00 | 329.00 | 2014.00 | 546.00 | 4.56 | 5.30 |
| 50% | 23.04 | 8.00 | 2.00 | 32.00 | 4.00 | 5200.00 | 439.00 | 2015.00 | 696.00 | 4.78 | 5.56 |
| 75% | 25.43 | 8.00 | 5.00 | 32.00 | 4.00 | 7040.00 | 508.00 | 2018.00 | 886.50 | 5.01 | 5.86 |
| max | 30.71 | 13.00 | 24.00 | 1024.00 | 12.00 | 9720.00 | 855.00 | 2020.00 | 1089.00 | 6.62 | 7.85 |
Outliers do not appear to be errors.
Outlier Check on main_camera_mp size:
histogram_boxplot(df2, "main_camera_mp") # box_plot for main_camera_mp
df2[df2["main_camera_mp"]>40]
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | device_cost_range | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 59 | Infinix | Android | 17.32 | yes | no | 108.00 | 8.00 | 32.00 | 2.00 | 6000.00 | 209.00 | 2020 | 245 | 4.28 | 4.60 | Budget |
| 60 | Infinix | Android | 15.39 | yes | no | 108.00 | 8.00 | 64.00 | 4.00 | 5000.00 | 185.00 | 2020 | 173 | 4.36 | 4.71 | Budget |
| 61 | Infinix | Android | 15.39 | yes | no | 108.00 | 8.00 | 32.00 | 2.00 | 5000.00 | 185.00 | 2020 | 256 | 4.18 | 4.51 | Budget |
| 62 | Infinix | Android | 15.39 | yes | no | 108.00 | 16.00 | 32.00 | 3.00 | 4000.00 | 178.00 | 2019 | 316 | 4.56 | 4.60 | Budget |
| 63 | Infinix | Android | 15.29 | yes | no | 108.00 | 16.00 | 32.00 | 2.00 | 4000.00 | 165.00 | 2019 | 468 | 4.42 | 4.87 | Budget |
| 210 | ZTE | Android | 16.89 | yes | no | 48.00 | 16.00 | 128.00 | 8.00 | 5000.00 | 215.00 | 2019 | 336 | 5.67 | 6.71 | Premium |
| 215 | ZTE | Android | 16.89 | yes | no | 48.00 | 16.00 | 64.00 | 6.00 | 5000.00 | 215.00 | 2019 | 506 | 5.26 | 6.22 | Premium |
| 278 | Infinix | Android | 17.32 | yes | no | 108.00 | 8.00 | 32.00 | 2.00 | 6000.00 | 209.00 | 2020 | 320 | 4.41 | 4.61 | Budget |
| 279 | Infinix | Android | 15.39 | yes | no | 108.00 | 8.00 | 64.00 | 4.00 | 5000.00 | 185.00 | 2020 | 173 | 4.50 | 4.70 | Budget |
| 280 | Infinix | Android | 15.39 | yes | no | 108.00 | 8.00 | 32.00 | 2.00 | 5000.00 | 185.00 | 2020 | 329 | 4.37 | 4.49 | Budget |
| 281 | Infinix | Android | 15.39 | yes | no | 108.00 | 16.00 | 32.00 | 3.00 | 4000.00 | 178.00 | 2019 | 356 | 4.42 | 4.61 | Budget |
| 282 | Infinix | Android | 15.29 | yes | no | 108.00 | 16.00 | 32.00 | 2.00 | 4000.00 | 165.00 | 2019 | 497 | 4.42 | 4.87 | Budget |
| 2097 | Nokia | Windows | 10.29 | yes | no | 41.00 | 1.20 | 32.00 | 4.00 | 2000.00 | 158.00 | 2013 | 988 | 5.11 | 5.99 | Mid-Range |
| 3240 | Motorola | Android | 15.34 | yes | no | 48.00 | 25.00 | 128.00 | 4.00 | 3600.00 | 165.00 | 2019 | 371 | 5.33 | 5.54 | Mid-Range |
| 3381 | Motorola | Android | 15.34 | yes | no | 48.00 | 25.00 | 128.00 | 4.00 | 3600.00 | 165.00 | 2019 | 422 | 5.33 | 5.54 | Mid-Range |
Outliers do not appear to be errors.
Outlier Check on selfie_camera_mp
histogram_boxplot(df2, "selfie_camera_mp") # box_plot for selfie_camera_mp
selfie_outliers=df2[df2["selfie_camera_mp"]>25]
selfie_outliers.sample(n=10)
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | device_cost_range | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 73 | Lenovo | Android | 16.23 | yes | yes | 8.00 | 32.00 | 128.00 | 6.00 | 4000.00 | 185.00 | 2019 | 370 | 5.34 | 5.99 | Mid-Range |
| 3201 | Honor | Android | 15.77 | yes | no | 13.00 | 32.00 | 128.00 | 4.00 | 3400.00 | 164.00 | 2019 | 182 | 3.85 | 4.93 | Budget |
| 122 | Oppo | Android | 16.64 | yes | yes | 13.00 | 32.00 | 128.00 | 8.00 | 4000.00 | 172.00 | 2020 | 187 | 5.17 | 6.15 | Premium |
| 41 | Huawei | Android | 15.90 | yes | no | 13.00 | 32.00 | 64.00 | 6.00 | 4000.00 | 178.00 | 2019 | 206 | 5.00 | 5.29 | Mid-Range |
| 84 | LG | Android | 15.34 | yes | yes | 8.00 | 32.00 | 256.00 | 8.00 | 4000.00 | 192.00 | 2019 | 383 | 5.47 | 7.00 | Premium |
| 3391 | Oppo | Android | 15.37 | yes | yes | 13.00 | 32.00 | 256.00 | 12.00 | 4025.00 | 171.00 | 2020 | 348 | 5.63 | 6.23 | Premium |
| 3311 | Xiaomi | Android | 16.43 | yes | no | 12.00 | 32.00 | 256.00 | 8.00 | 5260.00 | 208.00 | 2019 | 355 | 5.53 | 6.33 | Premium |
| 128 | Oppo | Android | 15.34 | yes | yes | 8.00 | 32.00 | 128.00 | 8.00 | 4025.00 | 181.00 | 2019 | 397 | 4.87 | 6.09 | Premium |
| 3250 | Oppo | Android | 15.37 | yes | yes | 13.00 | 32.00 | 256.00 | 12.00 | 4025.00 | 171.00 | 2020 | 227 | 5.45 | 6.23 | Premium |
| 347 | Oppo | Android | 15.34 | yes | yes | 8.00 | 32.00 | 128.00 | 8.00 | 4025.00 | 181.00 | 2019 | 330 | 4.90 | 6.09 | Premium |
df2[df2["selfie_camera_mp"]>25].describe()
| screen_size | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 94.00 | 94.00 | 94.00 | 94.00 | 94.00 | 94.00 | 94.00 | 94.00 | 94.00 | 94.00 | 94.00 |
| mean | 15.86 | 11.63 | 32.00 | 133.45 | 6.64 | 4089.36 | 181.58 | 2019.37 | 295.05 | 5.16 | 5.87 |
| std | 0.49 | 2.01 | 0.00 | 47.54 | 1.84 | 458.18 | 13.13 | 0.49 | 105.76 | 0.32 | 0.47 |
| min | 15.27 | 8.00 | 32.00 | 64.00 | 4.00 | 3340.00 | 152.90 | 2019.00 | 91.00 | 3.85 | 4.93 |
| 25% | 15.37 | 10.50 | 32.00 | 128.00 | 6.00 | 4000.00 | 172.45 | 2019.00 | 213.50 | 5.03 | 5.60 |
| 50% | 15.90 | 13.00 | 32.00 | 128.00 | 6.00 | 4000.00 | 180.00 | 2019.00 | 282.50 | 5.18 | 5.88 |
| 75% | 16.23 | 13.00 | 32.00 | 128.00 | 8.00 | 4315.00 | 186.00 | 2020.00 | 367.25 | 5.36 | 6.09 |
| max | 16.94 | 13.00 | 32.00 | 256.00 | 12.00 | 6000.00 | 217.00 | 2020.00 | 520.00 | 5.79 | 7.00 |
Appears the outliers might be in error since they are all 32 mp but their main_camera_mp are very low.
Appear to be mistakes and should be dropped.
df2.drop(selfie_camera_outliers.index,axis=0, inplace=True) # dropping the selfie_camera_mp outliers as there is no meanningful values.
df2.shape
(3454, 16)
df2.selfie_camera_mp.describe() # checking "selfie_camera" again using describe()
count 3454.00 mean 6.56 std 6.97 min 0.00 25% 2.00 50% 5.00 75% 8.00 max 32.00 Name: selfie_camera_mp, dtype: float64
Outlier Check on int_memory
histogram_boxplot(df2, "int_memory") # box_plot for int_memory
df2[df2["int_memory"]>200].describe() # checking the values with >200 GM int_memory
| screen_size | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 138.00 | 138.00 | 138.00 | 138.00 | 138.00 | 138.00 | 138.00 | 138.00 | 138.00 | 138.00 | 138.00 |
| mean | 12.84 | 7.82 | 9.54 | 382.14 | 4.01 | 2957.80 | 171.48 | 2016.35 | 581.54 | 4.43 | 5.44 |
| std | 4.73 | 4.97 | 10.10 | 198.65 | 3.89 | 1484.08 | 65.05 | 2.92 | 269.93 | 1.23 | 1.39 |
| min | 5.18 | 0.30 | 0.30 | 256.00 | 0.25 | 1000.00 | 75.00 | 2013.00 | 112.00 | 2.06 | 2.95 |
| 25% | 7.75 | 2.29 | 0.30 | 256.00 | 0.25 | 1462.50 | 135.50 | 2013.00 | 356.50 | 3.13 | 4.10 |
| 50% | 15.24 | 8.00 | 8.00 | 256.00 | 4.00 | 3300.00 | 166.50 | 2018.00 | 571.00 | 4.94 | 5.99 |
| 75% | 16.23 | 13.00 | 16.00 | 512.00 | 8.00 | 4061.25 | 198.00 | 2019.00 | 827.75 | 5.47 | 6.59 |
| max | 30.71 | 16.00 | 32.00 | 1024.00 | 12.00 | 9720.00 | 631.00 | 2020.00 | 1056.00 | 6.62 | 7.85 |
df2[df2["int_memory"]>200].sample(n=10)
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | device_cost_range | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2944 | Xiaomi | Android | 15.21 | yes | no | 13.00 | 5.00 | 256.00 | 4.00 | 3400.00 | 191.00 | 2018 | 590 | 5.19 | 5.99 | Mid-Range |
| 161 | Vivo | Android | 17.50 | yes | yes | 13.00 | 16.00 | 256.00 | 8.00 | 4500.00 | 219.50 | 2020 | 226 | 5.47 | 6.45 | Premium |
| 3334 | Honor | Android | 16.69 | yes | yes | 13.00 | 8.00 | 256.00 | 8.00 | 4000.00 | 190.00 | 2020 | 266 | 5.56 | 6.45 | Premium |
| 2112 | OnePlus | Android | 16.28 | yes | no | 13.00 | 16.00 | 256.00 | 4.00 | 3700.00 | 185.00 | 2018 | 496 | 5.00 | 6.31 | Premium |
| 1631 | Lenovo | Android | 10.16 | no | no | 3.15 | 1.00 | 512.00 | 0.25 | 1500.00 | 134.00 | 2013 | 836 | 3.56 | 4.51 | Budget |
| 3166 | ZTE | Android | 10.16 | no | no | 3.15 | 5.00 | 512.00 | 0.25 | 1400.00 | 120.00 | 2014 | 1006 | 3.17 | 4.10 | Budget |
| 2251 | Others | Android | 12.70 | no | no | 5.00 | 0.30 | 1024.00 | 4.00 | 2200.00 | 170.00 | 2013 | 584 | 4.07 | 5.14 | Budget |
| 2932 | Xiaomi | Android | 15.70 | yes | no | 13.00 | 20.00 | 256.00 | 4.00 | 4000.00 | 182.00 | 2018 | 390 | 4.91 | 5.67 | Mid-Range |
| 1621 | Lenovo | Android | 7.75 | no | no | 2.00 | 1.00 | 512.00 | 0.25 | 1300.00 | 112.00 | 2013 | 987 | 3.18 | 4.24 | Budget |
| 1645 | LG | Android | 15.24 | yes | no | 13.00 | 5.00 | 256.00 | 4.00 | 3300.00 | 158.00 | 2018 | 366 | 5.40 | 6.48 | Premium |
Outliers do not appear to be errors.
Outlier Check on ram
histogram_boxplot(df2, "ram")
quartiles = np.quantile(df2['ram'][df2['ram'].notnull()], [.25, .75]) # specifying 1st and 3rd quartile limits
ram_4iqr = 4 * (quartiles[1] - quartiles[0])
print(f'Q1 = {quartiles[0]}, Q3 = {quartiles[1]}, 4*IQR = {ram_4iqr}') # getting the values outside 4 times of IQR
outlier_quartiles = df2.loc[np.abs(df2['ram'] - df2['ram'].median()) > ram_4iqr, 'ram'] # printing those values
outlier_quartiles
Q1 = 4.0, Q3 = 4.0, 4*IQR = 0.0
0 3.00
1 8.00
2 8.00
3 6.00
4 3.00
...
3449 6.00
3450 8.00
3451 3.00
3452 2.00
3453 2.00
Name: ram, Length: 639, dtype: float64
Going to drop ram since most are 4GB. Doesn't appear ram has much impact on anything.
# Dropping the column ram
df2.drop("ram", axis=1, inplace=True)
Outlier Check on battery
histogram_boxplot(df2, "battery")
df2[df2["battery"]>6000].describe() # checking the details of batteries with more than 6000 mAh
| screen_size | main_camera_mp | selfie_camera_mp | int_memory | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 110.00 | 110.00 | 110.00 | 110.00 | 110.00 | 110.00 | 110.00 | 110.00 | 110.00 | 110.00 |
| mean | 24.49 | 8.42 | 4.25 | 60.80 | 7467.80 | 488.70 | 2016.18 | 655.05 | 4.96 | 5.78 |
| std | 3.54 | 3.49 | 2.80 | 136.21 | 993.13 | 139.49 | 2.41 | 253.31 | 0.32 | 0.43 |
| min | 12.83 | 2.00 | 0.30 | 16.00 | 6020.00 | 130.40 | 2013.00 | 139.00 | 4.34 | 5.07 |
| 25% | 25.41 | 5.00 | 2.00 | 16.00 | 6800.00 | 452.50 | 2014.00 | 468.25 | 4.74 | 5.52 |
| 50% | 25.43 | 8.00 | 5.00 | 32.00 | 7250.00 | 488.00 | 2016.00 | 645.50 | 4.96 | 5.70 |
| 75% | 25.50 | 12.00 | 8.00 | 64.00 | 8000.00 | 575.00 | 2018.75 | 891.25 | 5.13 | 5.99 |
| max | 30.71 | 16.00 | 8.00 | 1024.00 | 9720.00 | 855.00 | 2020.00 | 1089.00 | 6.16 | 7.05 |
df2[df2["battery"]>6000].sample(n=10)
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | device_cost_range | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1031 | Gionee | Android | 15.24 | yes | no | 12.00 | 8.00 | 64.00 | 6020.00 | 215.00 | 2017 | 623 | 4.99 | 6.13 | Premium |
| 2371 | Samsung | Android | 25.53 | yes | no | 13.00 | 8.00 | 128.00 | 7040.00 | 400.00 | 2019 | 252 | 5.36 | 6.04 | Premium |
| 51 | Huawei | Android | 20.42 | yes | no | 13.00 | 8.00 | 64.00 | 6100.00 | 320.00 | 2019 | 462 | 5.00 | 5.56 | Mid-Range |
| 33 | Huawei | Android | 25.50 | yes | no | 8.00 | 8.00 | 64.00 | 7250.00 | 450.00 | 2020 | 211 | 5.13 | 5.52 | Mid-Range |
| 1035 | Gionee | Android | 12.88 | yes | no | 13.00 | 8.00 | 128.00 | 7000.00 | 238.00 | 2016 | 656 | 5.67 | 7.05 | Premium |
| 664 | Apple | iOS | 23.04 | yes | no | 8.00 | 1.20 | 16.00 | 7340.00 | 437.00 | 2014 | 914 | 4.59 | 6.09 | Premium |
| 2546 | Samsung | Android | 25.53 | yes | no | 8.00 | 2.10 | 16.00 | 7900.00 | 467.00 | 2014 | 723 | 4.63 | 6.09 | Premium |
| 2585 | Samsung | Android | 30.53 | no | no | 8.00 | 2.00 | 32.00 | 9500.00 | 732.00 | 2014 | 799 | 5.33 | 6.15 | Premium |
| 2588 | Samsung | Android | 25.43 | no | no | 8.00 | 2.00 | 16.00 | 8220.00 | 469.00 | 2014 | 920 | 4.74 | 5.89 | Mid-Range |
| 1545 | Lenovo | Android | 20.32 | no | no | 8.00 | 5.00 | 16.00 | 6200.00 | 467.20 | 2015 | 770 | 4.72 | 5.07 | Budget |
Outliers do not appear to be errors. They seem to be inline with screen size and weight
Outlier Check on weight
histogram_boxplot(df2, "weight")
df2[df2["weight"]>300] # checking the details of devices weighs >300 grams
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | device_cost_range | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | Honor | Android | 25.50 | yes | yes | 13.00 | 8.00 | 64.00 | 7250.00 | 480.00 | 2020 | 345 | 5.14 | 5.63 | Mid-Range |
| 21 | Others | Android | 20.32 | no | no | 8.00 | 0.30 | 16.00 | 5680.00 | 453.60 | 2013 | 933 | 4.45 | 5.48 | Mid-Range |
| 31 | Huawei | Android | 20.32 | yes | no | 5.00 | 2.00 | 16.00 | 5100.00 | 310.00 | 2020 | 231 | 4.17 | 4.60 | Budget |
| 33 | Huawei | Android | 25.50 | yes | no | 8.00 | 8.00 | 64.00 | 7250.00 | 450.00 | 2020 | 211 | 5.13 | 5.52 | Mid-Range |
| 36 | Huawei | Android | 25.60 | yes | yes | 13.00 | 8.00 | 128.00 | 7250.00 | 460.00 | 2020 | 139 | 5.18 | 6.31 | Premium |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2996 | Xiaomi | Android | 18.01 | no | no | 8.00 | 5.00 | 16.00 | 6700.00 | 360.00 | 2014 | 615 | 4.46 | 5.49 | Mid-Range |
| 3030 | XOLO | Android | 17.78 | no | no | 5.00 | 0.30 | 16.00 | 4100.00 | 320.00 | 2013 | 686 | 4.21 | 5.24 | Budget |
| 3039 | XOLO | Android | 17.78 | no | no | 2.00 | 0.65 | 32.00 | 4000.00 | 338.00 | 2013 | 1039 | 4.27 | 4.69 | Budget |
| 3040 | XOLO | Android | 20.32 | no | no | 2.00 | 0.30 | 16.00 | 4000.00 | 480.00 | 2013 | 1004 | 4.38 | 4.78 | Budget |
| 3051 | Others | Android | 12.83 | yes | no | 16.00 | 8.00 | 32.00 | 9000.00 | 469.00 | 2018 | 621 | 4.94 | 5.52 | Mid-Range |
287 rows × 15 columns
Weight is skewed. Creating a log to try to make it a more normal distribution.
df2["weight_log"] = np.log(df2["weight"]) # normalized_used_price column
histogram_boxplot(df2, "weight_log", bins=25)
plt.xlabel("Weight of Phone in Grams")
Text(0.5, 43.249999999999986, 'Weight of Phone in Grams')
df2.weight_log.describe()
count 3454.00 mean 5.14 std 0.34 min 4.23 25% 4.96 50% 5.08 75% 5.22 max 6.75 Name: weight_log, dtype: float64
df2[df2["weight"]>6] # checking the details of devices weighs >6
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | device_cost_range | weight_log | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Honor | Android | 14.50 | yes | no | 13.00 | 5.00 | 64.00 | 3020.00 | 146.00 | 2020 | 127 | 4.31 | 4.72 | Budget | 4.98 |
| 1 | Honor | Android | 17.30 | yes | yes | 13.00 | 16.00 | 128.00 | 4300.00 | 213.00 | 2020 | 325 | 5.16 | 5.52 | Mid-Range | 5.36 |
| 2 | Honor | Android | 16.69 | yes | yes | 13.00 | 8.00 | 128.00 | 4200.00 | 213.00 | 2020 | 162 | 5.11 | 5.88 | Mid-Range | 5.36 |
| 3 | Honor | Android | 25.50 | yes | yes | 13.00 | 8.00 | 64.00 | 7250.00 | 480.00 | 2020 | 345 | 5.14 | 5.63 | Mid-Range | 6.17 |
| 4 | Honor | Android | 15.32 | yes | no | 13.00 | 8.00 | 64.00 | 5000.00 | 185.00 | 2020 | 293 | 4.39 | 4.95 | Budget | 5.22 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3449 | Asus | Android | 15.34 | yes | no | 13.00 | 8.00 | 64.00 | 5000.00 | 190.00 | 2019 | 232 | 4.49 | 6.48 | Premium | 5.25 |
| 3450 | Asus | Android | 15.24 | yes | no | 13.00 | 8.00 | 128.00 | 4000.00 | 200.00 | 2018 | 541 | 5.04 | 6.25 | Premium | 5.30 |
| 3451 | Alcatel | Android | 15.80 | yes | no | 13.00 | 5.00 | 32.00 | 4000.00 | 165.00 | 2020 | 201 | 4.36 | 4.53 | Budget | 5.11 |
| 3452 | Alcatel | Android | 15.80 | yes | no | 13.00 | 5.00 | 32.00 | 4000.00 | 160.00 | 2020 | 149 | 4.35 | 4.62 | Budget | 5.08 |
| 3453 | Alcatel | Android | 12.83 | yes | no | 13.00 | 5.00 | 16.00 | 4000.00 | 168.00 | 2020 | 176 | 4.13 | 4.28 | Budget | 5.12 |
3454 rows × 16 columns
Outliers do not appear to be errors. They seem to be inline with screen size and battery
Outlier Check on days_used
histogram_boxplot(df2, "days_used")
There are no outliers
Outlier Checks on normalized_new_price
histogram_boxplot(df2, "normalized_new_price")
df2[df2["normalized_new_price"]>6] # checking the devices values >6
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | device_cost_range | weight_log | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 12 | Honor | Android | 16.69 | yes | yes | 13.00 | 16.00 | 128.00 | 4100.00 | 206.00 | 2019 | 537 | 5.38 | 6.21 | Premium | 5.33 |
| 25 | HTC | Android | 15.24 | yes | no | 13.00 | 14.50 | 128.00 | 3930.00 | 180.00 | 2019 | 202 | 5.21 | 6.04 | Premium | 5.19 |
| 34 | Huawei | Android | 16.71 | yes | yes | 10.50 | 16.00 | 256.00 | 4200.00 | 226.00 | 2020 | 120 | 6.15 | 7.24 | Premium | 5.42 |
| 36 | Huawei | Android | 25.60 | yes | yes | 13.00 | 8.00 | 128.00 | 7250.00 | 460.00 | 2020 | 139 | 5.18 | 6.31 | Premium | 6.13 |
| 40 | Huawei | Android | 25.60 | yes | no | 13.00 | 8.00 | 128.00 | 7250.00 | 460.00 | 2019 | 256 | 5.00 | 6.11 | Premium | 6.13 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3446 | Apple | iOS | 12.90 | yes | no | 8.00 | 7.00 | 64.00 | 3046.00 | 188.00 | 2019 | 351 | 5.44 | 6.71 | Premium | 5.24 |
| 3447 | Apple | iOS | 15.27 | yes | no | 8.00 | 7.00 | 64.00 | 3110.00 | 194.00 | 2019 | 208 | 5.10 | 6.29 | Premium | 5.27 |
| 3448 | Asus | Android | 16.74 | yes | no | 13.00 | 24.00 | 128.00 | 6000.00 | 240.00 | 2019 | 325 | 5.72 | 7.06 | Premium | 5.48 |
| 3449 | Asus | Android | 15.34 | yes | no | 13.00 | 8.00 | 64.00 | 5000.00 | 190.00 | 2019 | 232 | 4.49 | 6.48 | Premium | 5.25 |
| 3450 | Asus | Android | 15.24 | yes | no | 13.00 | 8.00 | 128.00 | 4000.00 | 200.00 | 2018 | 541 | 5.04 | 6.25 | Premium | 5.30 |
391 rows × 16 columns
These are newer phones so the price seems not in error. Seems comparable with the release_year.
Outlier Checks on used_price
histogram_boxplot(df2, "normalized_used_price")
df2[df2["normalized_used_price"]<3] # fetching the details of devices <3 Euros
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | device_cost_range | weight_log | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 104 | Micromax | Android | 7.75 | no | no | 0.30 | 0.30 | 0.50 | 1500.00 | 89.00 | 2014 | 956 | 2.39 | 3.66 | Budget | 4.49 |
| 105 | Micromax | Android | 10.03 | no | no | 2.00 | 0.30 | 0.20 | 2000.00 | 85.00 | 2013 | 680 | 2.86 | 4.24 | Budget | 4.44 |
| 107 | Micromax | Android | 7.75 | no | no | 0.30 | 0.30 | 16.00 | 2000.00 | 89.00 | 2013 | 1029 | 2.99 | 3.88 | Budget | 4.49 |
| 113 | Nokia | Others | 5.18 | no | no | 0.30 | 0.00 | 0.10 | 1020.00 | 90.50 | 2020 | 272 | 2.91 | 3.40 | Budget | 4.51 |
| 114 | Nokia | Others | 5.18 | no | no | 0.30 | 0.00 | 0.06 | 1020.00 | 91.30 | 2020 | 288 | 2.73 | 2.91 | Budget | 4.51 |
| 117 | Nokia | Others | 5.18 | yes | no | 0.30 | 0.00 | 0.10 | 1200.00 | 86.50 | 2019 | 234 | 2.72 | 3.69 | Budget | 4.46 |
| 141 | Realme | Android | 15.27 | yes | no | 13.00 | 5.00 | 32.00 | 4000.00 | 166.00 | 2020 | 167 | 2.90 | 3.69 | Budget | 5.11 |
| 323 | Micromax | Android | 7.75 | no | no | 0.30 | 0.30 | 0.50 | 1500.00 | 89.00 | 2014 | 757 | 1.57 | 3.69 | Budget | 4.49 |
| 326 | Micromax | Android | 7.75 | no | no | 0.30 | 0.30 | 16.00 | 2000.00 | 89.00 | 2013 | 897 | 2.85 | 3.87 | Budget | 4.49 |
| 332 | Nokia | Others | 5.18 | no | no | 0.30 | 0.00 | 0.06 | 1020.00 | 90.50 | 2020 | 322 | 2.52 | 3.39 | Budget | 4.51 |
| 333 | Nokia | Others | 5.18 | no | no | 0.30 | 0.00 | 0.10 | 1020.00 | 91.30 | 2020 | 148 | 2.56 | 3.01 | Budget | 4.51 |
| 336 | Nokia | Others | 5.18 | yes | no | 0.30 | 0.00 | 0.10 | 1200.00 | 86.50 | 2019 | 414 | 2.79 | 3.68 | Budget | 4.46 |
| 360 | Realme | Android | 15.27 | yes | no | 13.00 | 5.00 | 32.00 | 4000.00 | 166.00 | 2020 | 247 | 2.67 | 3.71 | Budget | 5.11 |
| 381 | Others | Android | 17.78 | no | no | 8.00 | 0.30 | 8.00 | 2000.00 | 242.00 | 2019 | 513 | 2.82 | 3.87 | Budget | 5.49 |
| 516 | Alcatel | Others | 7.75 | no | no | 2.00 | 0.30 | 512.00 | 2820.00 | 98.00 | 2015 | 786 | 2.86 | 3.68 | Budget | 4.58 |
| 520 | Alcatel | Others | 7.75 | no | no | 1.30 | 0.30 | 256.00 | 1000.00 | 93.00 | 2014 | 672 | 2.51 | 3.32 | Budget | 4.53 |
| 528 | Alcatel | Android | 7.75 | no | no | 2.00 | 0.30 | 16.00 | 1300.00 | 100.00 | 2014 | 921 | 2.78 | 4.39 | Budget | 4.61 |
| 533 | Alcatel | Others | 5.18 | no | no | 0.30 | 0.30 | 16.00 | 850.00 | 77.90 | 2014 | 809 | 1.62 | 3.37 | Budget | 4.36 |
| 534 | Alcatel | Others | 5.18 | no | no | 2.00 | 0.30 | 128.00 | 850.00 | 85.00 | 2014 | 656 | 2.34 | 3.01 | Budget | 4.44 |
| 545 | Alcatel | Others | 5.18 | no | no | 2.00 | 0.30 | 16.00 | 1000.00 | 89.00 | 2013 | 606 | 2.83 | 4.10 | Budget | 4.49 |
| 558 | Alcatel | Android | 7.75 | no | no | 2.00 | 0.30 | 512.00 | 1300.00 | 118.00 | 2013 | 858 | 2.84 | 3.93 | Budget | 4.77 |
| 618 | Others | Others | 5.18 | no | no | 0.08 | 2.00 | 16.00 | 1000.00 | 80.00 | 2014 | 1008 | 2.63 | 2.94 | Budget | 4.38 |
| 630 | Others | Others | 5.28 | no | no | 2.00 | 0.30 | 16.00 | 1280.00 | 113.60 | 2013 | 1046 | 2.92 | 3.05 | Budget | 4.73 |
| 636 | Others | Android | 17.78 | no | no | 2.00 | 0.30 | 32.00 | 2980.00 | 313.00 | 2015 | 821 | 2.89 | 3.93 | Budget | 5.75 |
| 884 | Others | Others | 5.08 | no | no | 1.30 | 0.30 | 128.00 | 820.00 | 80.00 | 2013 | 905 | 2.21 | 3.04 | Budget | 4.38 |
| 885 | Others | Others | 5.08 | no | no | 0.30 | 0.30 | 32.00 | 820.00 | 80.00 | 2013 | 636 | 1.54 | 3.04 | Budget | 4.38 |
| 886 | Others | Others | 5.28 | no | no | 1.30 | 0.30 | 32.00 | 820.00 | 90.00 | 2013 | 695 | 2.34 | 3.04 | Budget | 4.50 |
| 889 | Others | Android | 7.75 | no | no | 2.00 | 0.30 | 512.00 | 1000.00 | 105.00 | 2013 | 897 | 2.71 | 3.71 | Budget | 4.65 |
| 896 | Others | Others | 5.28 | no | no | 1.30 | 0.30 | 64.00 | 1400.00 | 95.00 | 2013 | 729 | 2.71 | 3.36 | Budget | 4.55 |
| 897 | Others | Others | 5.18 | no | no | 0.30 | 0.30 | 32.00 | 820.00 | 90.00 | 2013 | 883 | 2.46 | 3.39 | Budget | 4.50 |
| 898 | Others | Others | 5.08 | no | no | 1.30 | 0.30 | 64.00 | 620.00 | 69.00 | 2013 | 990 | 2.24 | 3.01 | Budget | 4.23 |
| 946 | Celkon | Android | 7.75 | no | no | 3.15 | 0.30 | 512.00 | 1400.00 | 102.00 | 2014 | 573 | 2.69 | 3.68 | Budget | 4.62 |
| 950 | Celkon | Android | 7.75 | no | no | 3.15 | 0.30 | 512.00 | 1400.00 | 140.00 | 2013 | 567 | 2.91 | 3.91 | Budget | 4.94 |
| 951 | Celkon | Others | 7.75 | no | no | 1.30 | 0.30 | 256.00 | 1200.00 | 140.00 | 2013 | 836 | 2.47 | 3.35 | Budget | 4.94 |
| 952 | Celkon | Others | 5.18 | no | no | 1.30 | 0.30 | 256.00 | 1800.00 | 140.00 | 2013 | 586 | 2.33 | 2.95 | Budget | 4.94 |
| 953 | Celkon | Others | 5.28 | no | no | 1.30 | 0.30 | 256.00 | 1400.00 | 140.00 | 2013 | 561 | 2.06 | 3.03 | Budget | 4.94 |
| 954 | Celkon | Others | 5.23 | no | no | 1.30 | 0.30 | 256.00 | 1400.00 | 140.00 | 2013 | 796 | 2.24 | 3.02 | Budget | 4.94 |
| 955 | Celkon | Others | 5.23 | no | no | 0.30 | 0.30 | 256.00 | 1400.00 | 140.00 | 2013 | 687 | 2.32 | 3.07 | Budget | 4.94 |
| 957 | Celkon | Android | 10.16 | no | no | 3.15 | 0.30 | 512.00 | 1400.00 | 140.00 | 2013 | 637 | 2.80 | 3.88 | Budget | 4.94 |
| 959 | Celkon | Android | 7.75 | no | no | 1.30 | 0.30 | 256.00 | 1400.00 | 140.00 | 2013 | 945 | 2.88 | 3.93 | Budget | 4.94 |
| 963 | Celkon | Others | 10.16 | no | no | 2.00 | 0.30 | 256.00 | 1200.00 | 140.00 | 2013 | 635 | 2.86 | 3.64 | Budget | 4.94 |
| 964 | Celkon | Others | 7.75 | no | no | 2.00 | 0.30 | 256.00 | 1200.00 | 140.00 | 2013 | 569 | 2.88 | 3.66 | Budget | 4.94 |
| 965 | Celkon | Others | 5.18 | no | no | 1.30 | 0.30 | 256.00 | 1800.00 | 140.00 | 2013 | 899 | 2.36 | 2.98 | Budget | 4.94 |
| 966 | Celkon | Others | 5.18 | no | no | 1.30 | 0.30 | 256.00 | 1500.00 | 140.00 | 2013 | 818 | 2.43 | 3.38 | Budget | 4.94 |
| 967 | Celkon | Others | 5.23 | no | no | 1.30 | 0.30 | 256.00 | 1400.00 | 140.00 | 2013 | 980 | 2.62 | 3.68 | Budget | 4.94 |
| 968 | Celkon | Others | 5.23 | no | no | 1.30 | 0.30 | 256.00 | 3000.00 | 140.00 | 2013 | 787 | 2.58 | 3.38 | Budget | 4.94 |
| 1465 | Lava | Android | 10.16 | no | no | 3.15 | 0.30 | 512.00 | 1500.00 | 133.00 | 2013 | 677 | 2.89 | 3.91 | Budget | 4.89 |
| 1466 | Lava | Android | 7.75 | no | no | 2.00 | 0.30 | 512.00 | 1400.00 | 115.40 | 2013 | 912 | 2.84 | 3.73 | Budget | 4.75 |
| 1467 | Lava | Android | 7.75 | no | no | 2.00 | 0.30 | 512.00 | 1400.00 | 115.40 | 2013 | 924 | 2.74 | 3.73 | Budget | 4.75 |
| 1468 | Lava | Android | 10.16 | no | no | 2.00 | 0.30 | 1024.00 | 1400.00 | 113.00 | 2013 | 595 | 2.98 | 3.93 | Budget | 4.73 |
| 1784 | LG | Others | 5.13 | no | no | 1.30 | 1.30 | 16.00 | 1700.00 | 92.00 | 2013 | 766 | 2.95 | 3.66 | Budget | 4.52 |
| 1898 | Micromax | Others | 5.28 | no | no | 0.30 | 2.00 | 16.00 | 3000.00 | 146.50 | 2014 | 664 | 2.35 | 2.99 | Budget | 4.99 |
| 1903 | Micromax | Others | 5.18 | no | no | 0.30 | 0.30 | 16.00 | 2800.00 | 260.00 | 2013 | 1037 | 2.38 | 2.95 | Budget | 5.56 |
| 1904 | Micromax | Others | 5.16 | no | no | 0.30 | 0.30 | 16.00 | 1800.00 | 118.00 | 2013 | 868 | 2.75 | 2.94 | Budget | 4.77 |
| 1906 | Micromax | Android | 7.75 | no | no | 1.30 | 2.00 | 512.00 | 1500.00 | 146.50 | 2014 | 876 | 2.97 | 3.90 | Budget | 4.99 |
| 1924 | Micromax | Others | 5.28 | no | no | 0.30 | 0.30 | 16.00 | 2000.00 | 108.00 | 2013 | 603 | 2.72 | 3.00 | Budget | 4.68 |
| 1925 | Micromax | Others | 5.28 | no | no | 0.30 | 0.30 | 16.00 | 2000.00 | 98.60 | 2013 | 748 | 2.35 | 3.02 | Budget | 4.59 |
| 1926 | Micromax | Others | 5.23 | no | no | 0.30 | 0.30 | 16.00 | 2000.00 | 118.00 | 2013 | 596 | 2.71 | 2.96 | Budget | 4.77 |
| 1927 | Micromax | Others | 5.18 | no | no | 0.30 | 0.30 | 16.00 | 2000.00 | 118.00 | 2013 | 694 | 2.69 | 3.05 | Budget | 4.77 |
| 1928 | Micromax | Others | 5.13 | no | no | 0.30 | 0.30 | 16.00 | 2000.00 | 118.00 | 2013 | 1016 | 2.94 | 3.73 | Budget | 4.77 |
| 1929 | Micromax | Others | 5.28 | no | no | 0.30 | 0.30 | 16.00 | 2000.00 | 92.00 | 2013 | 946 | 2.27 | 3.41 | Budget | 4.52 |
| 2035 | Nokia | Others | 5.18 | no | no | 5.00 | 0.00 | 0.10 | 1020.00 | 157.00 | 2019 | 501 | 2.34 | 3.42 | Budget | 5.06 |
| 2058 | Nokia | Others | 5.18 | no | no | 0.30 | 0.00 | 0.06 | 1200.00 | 81.00 | 2016 | 815 | 2.72 | 3.37 | Budget | 4.39 |
| 2059 | Nokia | Others | 5.18 | no | no | 0.30 | 0.00 | 0.06 | 1200.00 | 82.60 | 2016 | 1023 | 2.77 | 3.66 | Budget | 4.41 |
| 2060 | Nokia | Others | 5.28 | no | no | 2.00 | 0.00 | 0.06 | 1200.00 | 91.80 | 2015 | 867 | 2.93 | 4.22 | Budget | 4.52 |
| 2064 | Nokia | Others | 5.18 | no | no | 0.30 | 0.00 | 0.06 | 1100.00 | 78.40 | 2015 | 559 | 2.59 | 3.38 | Budget | 4.36 |
| 2065 | Nokia | Others | 5.18 | no | no | 0.30 | 0.00 | 0.06 | 1100.00 | 78.60 | 2015 | 682 | 2.94 | 3.40 | Budget | 4.36 |
| 2288 | Others | Others | 5.18 | no | no | 0.30 | 2.00 | 32.00 | 2500.00 | 154.00 | 2015 | 706 | 2.93 | 3.01 | Budget | 5.04 |
| 2320 | Others | Others | 5.18 | no | no | 0.30 | 2.00 | 64.00 | 2100.00 | 150.00 | 2014 | 617 | 2.03 | 2.94 | Budget | 5.01 |
| 2324 | Others | Others | 5.18 | no | no | 0.30 | 2.00 | 32.00 | 2100.00 | 150.00 | 2014 | 789 | 2.32 | 2.90 | Budget | 5.01 |
| 2327 | Others | Others | 7.67 | no | no | 0.30 | 2.00 | 128.00 | 2100.00 | 150.00 | 2014 | 555 | 2.40 | 3.36 | Budget | 5.01 |
| 2516 | Samsung | Android | 7.70 | no | no | 2.00 | 2.00 | 16.00 | 1200.00 | 107.00 | 2014 | 1082 | 2.96 | 4.62 | Budget | 4.67 |
| 2533 | Samsung | Others | 5.08 | no | no | 8.00 | 2.00 | 16.00 | 800.00 | 75.00 | 2014 | 990 | 2.05 | 3.06 | Budget | 4.32 |
| 2641 | Samsung | Others | 7.62 | no | no | 2.00 | 1.30 | 16.00 | 1000.00 | 89.20 | 2013 | 542 | 2.86 | 4.07 | Budget | 4.49 |
| 2762 | Spice | Android | 7.75 | no | no | 1.30 | 0.30 | 512.00 | 2100.00 | 158.00 | 2013 | 794 | 2.66 | 3.73 | Budget | 5.06 |
| 2902 | Others | Android | 7.75 | no | no | 2.00 | 0.30 | 16.00 | 2000.00 | 118.00 | 2013 | 861 | 2.85 | 3.90 | Budget | 4.77 |
| 2911 | Others | Android | 7.75 | no | no | 2.00 | 1.30 | 16.00 | 2100.00 | 105.00 | 2014 | 955 | 2.87 | 3.89 | Budget | 4.65 |
| 3239 | Motorola | Android | 12.83 | yes | no | 13.00 | 5.00 | 16.00 | 3000.00 | 159.00 | 2019 | 379 | 2.83 | 3.93 | Budget | 5.07 |
df2[df2["normalized_used_price"]>5] # fetching the details of devices >5 Euros
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | device_cost_range | weight_log | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Honor | Android | 17.30 | yes | yes | 13.00 | 16.00 | 128.00 | 4300.00 | 213.00 | 2020 | 325 | 5.16 | 5.52 | Mid-Range | 5.36 |
| 2 | Honor | Android | 16.69 | yes | yes | 13.00 | 8.00 | 128.00 | 4200.00 | 213.00 | 2020 | 162 | 5.11 | 5.88 | Mid-Range | 5.36 |
| 3 | Honor | Android | 25.50 | yes | yes | 13.00 | 8.00 | 64.00 | 7250.00 | 480.00 | 2020 | 345 | 5.14 | 5.63 | Mid-Range | 6.17 |
| 12 | Honor | Android | 16.69 | yes | yes | 13.00 | 16.00 | 128.00 | 4100.00 | 206.00 | 2019 | 537 | 5.38 | 6.21 | Premium | 5.33 |
| 16 | Honor | Android | 15.90 | yes | no | 13.00 | 32.00 | 128.00 | 3750.00 | 172.00 | 2019 | 395 | 5.02 | 5.52 | Mid-Range | 5.15 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3445 | Apple | iOS | 15.37 | yes | no | 8.00 | 7.00 | 64.00 | 3969.00 | 226.00 | 2019 | 298 | 5.34 | 6.73 | Premium | 5.42 |
| 3446 | Apple | iOS | 12.90 | yes | no | 8.00 | 7.00 | 64.00 | 3046.00 | 188.00 | 2019 | 351 | 5.44 | 6.71 | Premium | 5.24 |
| 3447 | Apple | iOS | 15.27 | yes | no | 8.00 | 7.00 | 64.00 | 3110.00 | 194.00 | 2019 | 208 | 5.10 | 6.29 | Premium | 5.27 |
| 3448 | Asus | Android | 16.74 | yes | no | 13.00 | 24.00 | 128.00 | 6000.00 | 240.00 | 2019 | 325 | 5.72 | 7.06 | Premium | 5.48 |
| 3450 | Asus | Android | 15.24 | yes | no | 13.00 | 8.00 | 128.00 | 4000.00 | 200.00 | 2018 | 541 | 5.04 | 6.25 | Premium | 5.30 |
416 rows × 16 columns
Outliers are not in error. They appear to be the result of brand and release year
# Check for outliers
df2.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3454 entries, 0 to 3453 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3454 non-null category 1 os 3454 non-null category 2 screen_size 3454 non-null float64 3 4g 3454 non-null category 4 5g 3454 non-null category 5 main_camera_mp 3454 non-null float64 6 selfie_camera_mp 3454 non-null float64 7 int_memory 3454 non-null float64 8 battery 3454 non-null float64 9 weight 3454 non-null float64 10 release_year 3454 non-null int64 11 days_used 3454 non-null int64 12 normalized_used_price 3454 non-null float64 13 normalized_new_price 3454 non-null float64 14 device_cost_range 3454 non-null category 15 weight_log 3454 non-null float64 dtypes: category(5), float64(9), int64(2) memory usage: 315.7 KB
numeric_columns = df2.select_dtypes(include=np.number).columns.to_list()
# let's plot the boxplots of all columns to check for outliers
plt.figure(figsize=(20, 30))
for i, variable in enumerate(numeric_columns):
plt.subplot(4,3, i + 1)
plt.boxplot(df2[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
# Update color properties
boxprops = dict(color="red") # Change the box color to red
capprops = dict(color="blue") # Change the cap color to blue
whiskerprops = dict(color="purple") # Change the whisker color to purple
flierprops = dict(markerfacecolor="teal") # Change the flier marker color to teal
medianprops = dict(color="violet") # Change the median line color to violet
# Apply the updated properties
fig, ax = plt.subplots()
ax.set_title("Numerical Column Boxplots")
plt.boxplot(
df2[variable],
whis=1.5,
boxprops=boxprops,
capprops=capprops,
whiskerprops=whiskerprops,
flierprops=flierprops,
medianprops=medianprops,
)
plt.show()
<Figure size 2000x3000 with 0 Axes>
<Axes: >
{'whiskers': [<matplotlib.lines.Line2D at 0x7f623ddb8ca0>,
<matplotlib.lines.Line2D at 0x7f623eee2aa0>],
'caps': [<matplotlib.lines.Line2D at 0x7f623eee1360>,
<matplotlib.lines.Line2D at 0x7f623eee19c0>],
'boxes': [<matplotlib.lines.Line2D at 0x7f623e08fa90>],
'medians': [<matplotlib.lines.Line2D at 0x7f623eee1780>],
'fliers': [<matplotlib.lines.Line2D at 0x7f623eee30a0>],
'means': []}
Text(0.5, 1.0, 'screen_size')
<Axes: >
{'whiskers': [<matplotlib.lines.Line2D at 0x7f62507265c0>,
<matplotlib.lines.Line2D at 0x7f6250726b30>],
'caps': [<matplotlib.lines.Line2D at 0x7f62507272e0>,
<matplotlib.lines.Line2D at 0x7f62507256c0>],
'boxes': [<matplotlib.lines.Line2D at 0x7f6250726500>],
'medians': [<matplotlib.lines.Line2D at 0x7f6250725b10>],
'fliers': [<matplotlib.lines.Line2D at 0x7f62507277f0>],
'means': []}
Text(0.5, 1.0, 'main_camera_mp')
<Axes: >
{'whiskers': [<matplotlib.lines.Line2D at 0x7f623f9c93c0>,
<matplotlib.lines.Line2D at 0x7f623f9c9750>],
'caps': [<matplotlib.lines.Line2D at 0x7f623f9c8d00>,
<matplotlib.lines.Line2D at 0x7f623f9cb3a0>],
'boxes': [<matplotlib.lines.Line2D at 0x7f623f9c9bd0>],
'medians': [<matplotlib.lines.Line2D at 0x7f623f9cb880>],
'fliers': [<matplotlib.lines.Line2D at 0x7f623f9cb700>],
'means': []}
Text(0.5, 1.0, 'selfie_camera_mp')
<Axes: >
{'whiskers': [<matplotlib.lines.Line2D at 0x7f624a750b50>,
<matplotlib.lines.Line2D at 0x7f624a7510f0>],
'caps': [<matplotlib.lines.Line2D at 0x7f624a7518a0>,
<matplotlib.lines.Line2D at 0x7f624a753c10>],
'boxes': [<matplotlib.lines.Line2D at 0x7f624a750a60>],
'medians': [<matplotlib.lines.Line2D at 0x7f624a7524a0>],
'fliers': [<matplotlib.lines.Line2D at 0x7f624a7512a0>],
'means': []}
Text(0.5, 1.0, 'int_memory')
<Axes: >
{'whiskers': [<matplotlib.lines.Line2D at 0x7f62402faa70>,
<matplotlib.lines.Line2D at 0x7f62402fad10>],
'caps': [<matplotlib.lines.Line2D at 0x7f62402fafb0>,
<matplotlib.lines.Line2D at 0x7f62402fb250>],
'boxes': [<matplotlib.lines.Line2D at 0x7f62402fa7d0>],
'medians': [<matplotlib.lines.Line2D at 0x7f62402fb520>],
'fliers': [<matplotlib.lines.Line2D at 0x7f62402fb7c0>],
'means': []}
Text(0.5, 1.0, 'battery')
<Axes: >
{'whiskers': [<matplotlib.lines.Line2D at 0x7f623d7c2770>,
<matplotlib.lines.Line2D at 0x7f623d7c2a10>],
'caps': [<matplotlib.lines.Line2D at 0x7f623d7c2cb0>,
<matplotlib.lines.Line2D at 0x7f623d7c2f50>],
'boxes': [<matplotlib.lines.Line2D at 0x7f623d7c24d0>],
'medians': [<matplotlib.lines.Line2D at 0x7f623d7c31f0>],
'fliers': [<matplotlib.lines.Line2D at 0x7f623d7c3490>],
'means': []}
Text(0.5, 1.0, 'weight')
<Axes: >
{'whiskers': [<matplotlib.lines.Line2D at 0x7f623d34f430>,
<matplotlib.lines.Line2D at 0x7f623d34f6d0>],
'caps': [<matplotlib.lines.Line2D at 0x7f623d34f970>,
<matplotlib.lines.Line2D at 0x7f623d34fc10>],
'boxes': [<matplotlib.lines.Line2D at 0x7f623d34f190>],
'medians': [<matplotlib.lines.Line2D at 0x7f623d34feb0>],
'fliers': [<matplotlib.lines.Line2D at 0x7f623d384190>],
'means': []}
Text(0.5, 1.0, 'release_year')
<Axes: >
{'whiskers': [<matplotlib.lines.Line2D at 0x7f623d385d20>,
<matplotlib.lines.Line2D at 0x7f623d3d0580>],
'caps': [<matplotlib.lines.Line2D at 0x7f623d3d0820>,
<matplotlib.lines.Line2D at 0x7f623d3d0ac0>],
'boxes': [<matplotlib.lines.Line2D at 0x7f623d3a5840>],
'medians': [<matplotlib.lines.Line2D at 0x7f623d3d0d60>],
'fliers': [<matplotlib.lines.Line2D at 0x7f623d3d1000>],
'means': []}
Text(0.5, 1.0, 'days_used')
<Axes: >
{'whiskers': [<matplotlib.lines.Line2D at 0x7f623d238310>,
<matplotlib.lines.Line2D at 0x7f623d2385b0>],
'caps': [<matplotlib.lines.Line2D at 0x7f623d238850>,
<matplotlib.lines.Line2D at 0x7f623d238af0>],
'boxes': [<matplotlib.lines.Line2D at 0x7f623d238070>],
'medians': [<matplotlib.lines.Line2D at 0x7f623d238d90>],
'fliers': [<matplotlib.lines.Line2D at 0x7f623d239030>],
'means': []}
Text(0.5, 1.0, 'normalized_used_price')
<Axes: >
{'whiskers': [<matplotlib.lines.Line2D at 0x7f623d2942b0>,
<matplotlib.lines.Line2D at 0x7f623d294550>],
'caps': [<matplotlib.lines.Line2D at 0x7f623d2947f0>,
<matplotlib.lines.Line2D at 0x7f623d294a90>],
'boxes': [<matplotlib.lines.Line2D at 0x7f623d25bfd0>],
'medians': [<matplotlib.lines.Line2D at 0x7f623d294d30>],
'fliers': [<matplotlib.lines.Line2D at 0x7f623d294fd0>],
'means': []}
Text(0.5, 1.0, 'normalized_new_price')
<Axes: >
{'whiskers': [<matplotlib.lines.Line2D at 0x7f623d2e44f0>,
<matplotlib.lines.Line2D at 0x7f623d2e4790>],
'caps': [<matplotlib.lines.Line2D at 0x7f623d2e4940>,
<matplotlib.lines.Line2D at 0x7f623d2e4be0>],
'boxes': [<matplotlib.lines.Line2D at 0x7f623d2e4250>],
'medians': [<matplotlib.lines.Line2D at 0x7f623d2e4e80>],
'fliers': [<matplotlib.lines.Line2D at 0x7f623d2e5120>],
'means': []}
Text(0.5, 1.0, 'weight_log')
Text(0.5, 1.0, 'Numerical Column Boxplots')
{'whiskers': [<matplotlib.lines.Line2D at 0x7f623d142dd0>,
<matplotlib.lines.Line2D at 0x7f623d143070>],
'caps': [<matplotlib.lines.Line2D at 0x7f623d143310>,
<matplotlib.lines.Line2D at 0x7f623d1435b0>],
'boxes': [<matplotlib.lines.Line2D at 0x7f623d142b30>],
'medians': [<matplotlib.lines.Line2D at 0x7f623d143850>],
'fliers': [<matplotlib.lines.Line2D at 0x7f623d143af0>],
'means': []}
# Checking the distrinbution of all numeric columns using histplot.
plt.figure(figsize=(15, 45))
for i in range(len(numeric_columns)):
plt.subplot(12, 3, i + 1)
plt.hist(df2[numeric_columns[i]], bins=50,color = "teal")
plt.tight_layout()
plt.title(numeric_columns[i], fontsize=25)
plt.show()
<Figure size 1500x4500 with 0 Axes>
<Axes: >
(array([ 92., 0., 0., 1., 16., 88., 1., 0., 0., 194., 404.,
0., 2., 12., 720., 589., 6., 31., 28., 342., 247., 119.,
135., 31., 127., 17., 0., 0., 4., 106., 2., 0., 0.,
0., 5., 15., 0., 0., 0., 98., 8., 0., 0., 0.,
3., 0., 0., 0., 0., 11.]),
array([ 5.08 , 5.5926, 6.1052, 6.6178, 7.1304, 7.643 , 8.1556,
8.6682, 9.1808, 9.6934, 10.206 , 10.7186, 11.2312, 11.7438,
12.2564, 12.769 , 13.2816, 13.7942, 14.3068, 14.8194, 15.332 ,
15.8446, 16.3572, 16.8698, 17.3824, 17.895 , 18.4076, 18.9202,
19.4328, 19.9454, 20.458 , 20.9706, 21.4832, 21.9958, 22.5084,
23.021 , 23.5336, 24.0462, 24.5588, 25.0714, 25.584 , 26.0966,
26.6092, 27.1218, 27.6344, 28.147 , 28.6596, 29.1722, 29.6848,
30.1974, 30.71 ]),
<BarContainer object of 50 artists>)
Text(0.5, 1.0, 'screen_size')
<Axes: >
(array([2.690e+02, 1.370e+02, 5.500e+02, 7.700e+02, 3.600e+01, 1.415e+03,
1.900e+01, 1.550e+02, 9.000e+00, 5.300e+01, 2.100e+01, 5.000e+00,
0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
1.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 4.000e+00, 0.000e+00,
0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
0.000e+00, 1.000e+01]),
array([8.000000e-02, 2.238400e+00, 4.396800e+00, 6.555200e+00,
8.713600e+00, 1.087200e+01, 1.303040e+01, 1.518880e+01,
1.734720e+01, 1.950560e+01, 2.166400e+01, 2.382240e+01,
2.598080e+01, 2.813920e+01, 3.029760e+01, 3.245600e+01,
3.461440e+01, 3.677280e+01, 3.893120e+01, 4.108960e+01,
4.324800e+01, 4.540640e+01, 4.756480e+01, 4.972320e+01,
5.188160e+01, 5.404000e+01, 5.619840e+01, 5.835680e+01,
6.051520e+01, 6.267360e+01, 6.483200e+01, 6.699040e+01,
6.914880e+01, 7.130720e+01, 7.346560e+01, 7.562400e+01,
7.778240e+01, 7.994080e+01, 8.209920e+01, 8.425760e+01,
8.641600e+01, 8.857440e+01, 9.073280e+01, 9.289120e+01,
9.504960e+01, 9.720800e+01, 9.936640e+01, 1.015248e+02,
1.036832e+02, 1.058416e+02, 1.080000e+02]),
<BarContainer object of 50 artists>)
Text(0.5, 1.0, 'main_camera_mp')
<Axes: >
(array([531., 87., 192., 586., 7., 4., 20., 800., 0., 0., 21.,
0., 551., 0., 4., 10., 2., 0., 21., 0., 90., 3.,
5., 0., 0., 299., 0., 0., 2., 0., 0., 67., 0.,
0., 0., 0., 0., 30., 0., 28., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 94.]),
array([ 0. , 0.64, 1.28, 1.92, 2.56, 3.2 , 3.84, 4.48, 5.12,
5.76, 6.4 , 7.04, 7.68, 8.32, 8.96, 9.6 , 10.24, 10.88,
11.52, 12.16, 12.8 , 13.44, 14.08, 14.72, 15.36, 16. , 16.64,
17.28, 17.92, 18.56, 19.2 , 19.84, 20.48, 21.12, 21.76, 22.4 ,
23.04, 23.68, 24.32, 24.96, 25.6 , 26.24, 26.88, 27.52, 28.16,
28.8 , 29.44, 30.08, 30.72, 31.36, 32. ]),
<BarContainer object of 50 artists>)
Text(0.5, 1.0, 'selfie_camera_mp')
<Axes: >
(array([1351., 1084., 0., 509., 0., 0., 372., 0., 0.,
0., 0., 0., 86., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 44., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 8.]),
array([1.0000000e-02, 2.0489800e+01, 4.0969600e+01, 6.1449400e+01,
8.1929200e+01, 1.0240900e+02, 1.2288880e+02, 1.4336860e+02,
1.6384840e+02, 1.8432820e+02, 2.0480800e+02, 2.2528780e+02,
2.4576760e+02, 2.6624740e+02, 2.8672720e+02, 3.0720700e+02,
3.2768680e+02, 3.4816660e+02, 3.6864640e+02, 3.8912620e+02,
4.0960600e+02, 4.3008580e+02, 4.5056560e+02, 4.7104540e+02,
4.9152520e+02, 5.1200500e+02, 5.3248480e+02, 5.5296460e+02,
5.7344440e+02, 5.9392420e+02, 6.1440400e+02, 6.3488380e+02,
6.5536360e+02, 6.7584340e+02, 6.9632320e+02, 7.1680300e+02,
7.3728280e+02, 7.5776260e+02, 7.7824240e+02, 7.9872220e+02,
8.1920200e+02, 8.3968180e+02, 8.6016160e+02, 8.8064140e+02,
9.0112120e+02, 9.2160100e+02, 9.4208080e+02, 9.6256060e+02,
9.8304040e+02, 1.0035202e+03, 1.0240000e+03]),
<BarContainer object of 50 artists>)
Text(0.5, 1.0, 'int_memory')
<Axes: >
(array([ 2., 11., 23., 43., 91., 103., 92., 133., 414., 129., 232.,
147., 98., 440., 172., 162., 122., 45., 373., 119., 83., 83.,
16., 24., 142., 9., 5., 4., 4., 25., 9., 4., 2.,
10., 6., 16., 15., 7., 2., 4., 6., 6., 3., 1.,
1., 2., 3., 2., 6., 3.]),
array([ 500. , 684.4, 868.8, 1053.2, 1237.6, 1422. , 1606.4, 1790.8,
1975.2, 2159.6, 2344. , 2528.4, 2712.8, 2897.2, 3081.6, 3266. ,
3450.4, 3634.8, 3819.2, 4003.6, 4188. , 4372.4, 4556.8, 4741.2,
4925.6, 5110. , 5294.4, 5478.8, 5663.2, 5847.6, 6032. , 6216.4,
6400.8, 6585.2, 6769.6, 6954. , 7138.4, 7322.8, 7507.2, 7691.6,
7876. , 8060.4, 8244.8, 8429.2, 8613.6, 8798. , 8982.4, 9166.8,
9351.2, 9535.6, 9720. ]),
<BarContainer object of 50 artists>)
Text(0.5, 1.0, 'battery')
<Axes: >
(array([ 15., 71., 119., 310., 630., 722., 570., 341., 222., 62., 22.,
15., 20., 18., 35., 43., 28., 31., 16., 8., 8., 9.,
9., 5., 16., 19., 14., 17., 11., 5., 3., 1., 4.,
3., 5., 6., 3., 4., 5., 0., 1., 0., 3., 3.,
0., 0., 0., 0., 0., 2.]),
array([ 69. , 84.72, 100.44, 116.16, 131.88, 147.6 , 163.32, 179.04,
194.76, 210.48, 226.2 , 241.92, 257.64, 273.36, 289.08, 304.8 ,
320.52, 336.24, 351.96, 367.68, 383.4 , 399.12, 414.84, 430.56,
446.28, 462. , 477.72, 493.44, 509.16, 524.88, 540.6 , 556.32,
572.04, 587.76, 603.48, 619.2 , 634.92, 650.64, 666.36, 682.08,
697.8 , 713.52, 729.24, 744.96, 760.68, 776.4 , 792.12, 807.84,
823.56, 839.28, 855. ]),
<BarContainer object of 50 artists>)
Text(0.5, 1.0, 'weight')
<Axes: >
(array([570., 0., 0., 0., 0., 0., 0., 642., 0., 0., 0.,
0., 0., 0., 515., 0., 0., 0., 0., 0., 0., 383.,
0., 0., 0., 0., 0., 0., 299., 0., 0., 0., 0.,
0., 0., 322., 0., 0., 0., 0., 0., 0., 446., 0.,
0., 0., 0., 0., 0., 277.]),
array([2013. , 2013.14, 2013.28, 2013.42, 2013.56, 2013.7 , 2013.84,
2013.98, 2014.12, 2014.26, 2014.4 , 2014.54, 2014.68, 2014.82,
2014.96, 2015.1 , 2015.24, 2015.38, 2015.52, 2015.66, 2015.8 ,
2015.94, 2016.08, 2016.22, 2016.36, 2016.5 , 2016.64, 2016.78,
2016.92, 2017.06, 2017.2 , 2017.34, 2017.48, 2017.62, 2017.76,
2017.9 , 2018.04, 2018.18, 2018.32, 2018.46, 2018.6 , 2018.74,
2018.88, 2019.02, 2019.16, 2019.3 , 2019.44, 2019.58, 2019.72,
2019.86, 2020. ]),
<BarContainer object of 50 artists>)
Text(0.5, 1.0, 'release_year')
<Axes: >
(array([ 17., 18., 20., 15., 31., 63., 47., 42., 45., 45., 41.,
45., 47., 51., 42., 36., 44., 44., 42., 46., 40., 42.,
70., 124., 128., 120., 116., 90., 117., 113., 119., 110., 83.,
84., 116., 89., 92., 92., 78., 93., 77., 80., 64., 81.,
72., 82., 77., 84., 63., 77.]),
array([ 91. , 111.06, 131.12, 151.18, 171.24, 191.3 , 211.36,
231.42, 251.48, 271.54, 291.6 , 311.66, 331.72, 351.78,
371.84, 391.9 , 411.96, 432.02, 452.08, 472.14, 492.2 ,
512.26, 532.32, 552.38, 572.44, 592.5 , 612.56, 632.62,
652.68, 672.74, 692.8 , 712.86, 732.92, 752.98, 773.04,
793.1 , 813.16, 833.22, 853.28, 873.34, 893.4 , 913.46,
933.52, 953.58, 973.64, 993.7 , 1013.76, 1033.82, 1053.88,
1073.94, 1094. ]),
<BarContainer object of 50 artists>)
Text(0.5, 1.0, 'days_used')
<Axes: >
(array([ 3., 0., 0., 0., 1., 2., 3., 8., 6., 4., 5.,
13., 14., 15., 21., 23., 28., 41., 48., 65., 90., 95.,
114., 153., 194., 220., 241., 271., 284., 259., 229., 216., 189.,
169., 128., 96., 63., 49., 34., 25., 14., 8., 4., 3.,
1., 3., 0., 0., 1., 1.]),
array([1.53686722, 1.63851854, 1.74016985, 1.84182117, 1.94347248,
2.0451238 , 2.14677511, 2.24842643, 2.35007774, 2.45172906,
2.55338038, 2.65503169, 2.75668301, 2.85833432, 2.95998564,
3.06163695, 3.16328827, 3.26493959, 3.3665909 , 3.46824222,
3.56989353, 3.67154485, 3.77319616, 3.87484748, 3.97649879,
4.07815011, 4.17980143, 4.28145274, 4.38310406, 4.48475537,
4.58640669, 4.688058 , 4.78970932, 4.89136064, 4.99301195,
5.09466327, 5.19631458, 5.2979659 , 5.39961721, 5.50126853,
5.60291985, 5.70457116, 5.80622248, 5.90787379, 6.00952511,
6.11117642, 6.21282774, 6.31447905, 6.41613037, 6.51778169,
6.619433 ]),
<BarContainer object of 50 artists>)
Text(0.5, 1.0, 'normalized_used_price')
<Axes: >
(array([ 10., 15., 0., 0., 11., 6., 1., 19., 12., 9., 39.,
29., 36., 86., 62., 57., 107., 169., 138., 258., 104., 246.,
156., 203., 234., 164., 256., 155., 189., 130., 91., 120., 72.,
66., 42., 29., 38., 18., 20., 25., 7., 7., 5., 4.,
1., 3., 1., 0., 1., 3.]),
array([2.90142159, 3.00034998, 3.09927836, 3.19820674, 3.29713512,
3.3960635 , 3.49499188, 3.59392026, 3.69284864, 3.79177703,
3.89070541, 3.98963379, 4.08856217, 4.18749055, 4.28641893,
4.38534731, 4.48427569, 4.58320408, 4.68213246, 4.78106084,
4.87998922, 4.9789176 , 5.07784598, 5.17677436, 5.27570275,
5.37463113, 5.47355951, 5.57248789, 5.67141627, 5.77034465,
5.86927303, 5.96820141, 6.0671298 , 6.16605818, 6.26498656,
6.36391494, 6.46284332, 6.5617717 , 6.66070008, 6.75962847,
6.85855685, 6.95748523, 7.05641361, 7.15534199, 7.25427037,
7.35319875, 7.45212713, 7.55105552, 7.6499839 , 7.74891228,
7.84784066]),
<BarContainer object of 50 artists>)
Text(0.5, 1.0, 'normalized_new_price')
<Axes: >
(array([ 1., 2., 9., 3., 12., 27., 16., 29., 24., 50., 126.,
91., 150., 195., 377., 368., 352., 363., 242., 236., 181., 113.,
76., 23., 19., 12., 14., 19., 23., 43., 33., 34., 26.,
9., 12., 10., 9., 21., 28., 23., 10., 4., 7., 11.,
7., 5., 3., 4., 0., 2.]),
array([4.2341065 , 4.2844464 , 4.3347863 , 4.3851262 , 4.4354661 ,
4.485806 , 4.5361459 , 4.5864858 , 4.6368257 , 4.6871656 ,
4.7375055 , 4.7878454 , 4.8381853 , 4.8885252 , 4.93886509,
4.98920499, 5.03954489, 5.08988479, 5.14022469, 5.19056459,
5.24090449, 5.29124439, 5.34158429, 5.39192419, 5.44226409,
5.49260399, 5.54294389, 5.59328379, 5.64362368, 5.69396358,
5.74430348, 5.79464338, 5.84498328, 5.89532318, 5.94566308,
5.99600298, 6.04634288, 6.09668278, 6.14702268, 6.19736258,
6.24770248, 6.29804238, 6.34838227, 6.39872217, 6.44906207,
6.49940197, 6.54974187, 6.60008177, 6.65042167, 6.70076157,
6.75110147]),
<BarContainer object of 50 artists>)
Text(0.5, 1.0, 'weight_log')
cols_list = df2.select_dtypes(include=np.number).columns.tolist()
# dropping release_year as it is a temporal variable.
cols_list.remove("release_year")
cols_list.remove("weight_log")
plt.figure(figsize=(15, 7))
sns.heatmap(
df2[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="mako"
)
plt.show()
<Figure size 1500x700 with 0 Axes>
<Axes: >
There are a lot of outliers in multiple columns.
Only column not to have outliers is days_used.
Removing outliers could lead to missing valuable data.
Data Preparation for Modeling
Our goal is to predict the normalized price of used devices.
Need to create a train and a test set of data.
Need to build a Linear Regression model using the train data.
Well use the test data set to test the model built using the train data set.
df3=df2.copy()
df3.columns
Index(['brand_name', 'os', 'screen_size', '4g', '5g', 'main_camera_mp',
'selfie_camera_mp', 'int_memory', 'battery', 'weight', 'release_year',
'days_used', 'normalized_used_price', 'normalized_new_price',
'device_cost_range', 'weight_log'],
dtype='object')
df3[df3.isna().any(axis=1)]
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | device_cost_range | weight_log |
|---|
df3.shape
(3454, 16)
# Define x and y variables
x = df3.drop(["normalized_used_price","weight","device_cost_range"], axis=1) # dropping normalized_used_price", "device_cost_range", and "weight" from the data
y = df3["normalized_used_price"]
# independent variable
x.head()
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | battery | release_year | days_used | normalized_new_price | weight_log | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Honor | Android | 14.50 | yes | no | 13.00 | 5.00 | 64.00 | 3020.00 | 2020 | 127 | 4.72 | 4.98 |
| 1 | Honor | Android | 17.30 | yes | yes | 13.00 | 16.00 | 128.00 | 4300.00 | 2020 | 325 | 5.52 | 5.36 |
| 2 | Honor | Android | 16.69 | yes | yes | 13.00 | 8.00 | 128.00 | 4200.00 | 2020 | 162 | 5.88 | 5.36 |
| 3 | Honor | Android | 25.50 | yes | yes | 13.00 | 8.00 | 64.00 | 7250.00 | 2020 | 345 | 5.63 | 6.17 |
| 4 | Honor | Android | 15.32 | yes | no | 13.00 | 8.00 | 64.00 | 5000.00 | 2020 | 293 | 4.95 | 5.22 |
# dependent variable
y.head()
0 4.31 1 5.16 2 5.11 3 5.14 4 4.39 Name: normalized_used_price, dtype: float64
# creating dummy variables
x = pd.get_dummies(
x,
columns=x.select_dtypes(include=["object", "category"]).columns.tolist(),
drop_first=True, # Drop the first category to avoid multicollinearity
dtype=int,
) # Ensure te output is integer (numeric 0 and 1) instead of Boolean
x.head()
| screen_size | main_camera_mp | selfie_camera_mp | int_memory | battery | release_year | days_used | normalized_new_price | weight_log | brand_name_Alcatel | brand_name_Apple | brand_name_Asus | brand_name_BlackBerry | brand_name_Celkon | brand_name_Coolpad | brand_name_Gionee | brand_name_Google | brand_name_HTC | brand_name_Honor | brand_name_Huawei | brand_name_Infinix | brand_name_Karbonn | brand_name_LG | brand_name_Lava | brand_name_Lenovo | brand_name_Meizu | brand_name_Micromax | brand_name_Microsoft | brand_name_Motorola | brand_name_Nokia | brand_name_OnePlus | brand_name_Oppo | brand_name_Others | brand_name_Panasonic | brand_name_Realme | brand_name_Samsung | brand_name_Sony | brand_name_Spice | brand_name_Vivo | brand_name_XOLO | brand_name_Xiaomi | brand_name_ZTE | os_Others | os_Windows | os_iOS | 4g_yes | 5g_yes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 14.50 | 13.00 | 5.00 | 64.00 | 3020.00 | 2020 | 127 | 4.72 | 4.98 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1 | 17.30 | 13.00 | 16.00 | 128.00 | 4300.00 | 2020 | 325 | 5.52 | 5.36 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 2 | 16.69 | 13.00 | 8.00 | 128.00 | 4200.00 | 2020 | 162 | 5.88 | 5.36 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 3 | 25.50 | 13.00 | 8.00 | 64.00 | 7250.00 | 2020 | 345 | 5.63 | 6.17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 4 | 15.32 | 13.00 | 8.00 | 64.00 | 5000.00 | 2020 | 293 | 4.95 | 5.22 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
# splitting the data into 70:30 ratio for train to test data
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.30, random_state=1
)
# checking the shape of the the train and test data
print("Number of rows in train data =", x_train.shape[0])
print("Number of rows in test data =", x_test.shape[0])
Number of rows in train data = 2417 Number of rows in test data = 1037
# adding constant to the train data
x_train1 = sm.add_constant(x_train)
# adding constant to the test data
x_test1 = sm.add_constant(x_test)
linearregression = LinearRegression() # LinearRegression has been imported above
linearregression.fit(x_train1, y_train) # fit the dependent and independent train data
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
# datafram to show the model coefficients and intercept
coef_df = pd.DataFrame(
np.append(linearregression.coef_, linearregression.intercept_),
index=x_train1.columns.tolist() + ["Intercept"],
columns=["Coefficients"],
)
coef_df
| Coefficients | |
|---|---|
| const | 0.00 |
| screen_size | 0.02 |
| main_camera_mp | 0.02 |
| selfie_camera_mp | 0.01 |
| int_memory | 0.00 |
| battery | -0.00 |
| release_year | 0.02 |
| days_used | 0.00 |
| normalized_new_price | 0.44 |
| weight_log | 0.32 |
| brand_name_Alcatel | 0.01 |
| brand_name_Apple | 0.09 |
| brand_name_Asus | 0.01 |
| brand_name_BlackBerry | -0.07 |
| brand_name_Celkon | -0.09 |
| brand_name_Coolpad | 0.02 |
| brand_name_Gionee | 0.04 |
| brand_name_Google | -0.02 |
| brand_name_HTC | -0.02 |
| brand_name_Honor | 0.03 |
| brand_name_Huawei | -0.00 |
| brand_name_Infinix | -1.95 |
| brand_name_Karbonn | 0.10 |
| brand_name_LG | -0.01 |
| brand_name_Lava | 0.03 |
| brand_name_Lenovo | 0.04 |
| brand_name_Meizu | -0.01 |
| brand_name_Micromax | -0.03 |
| brand_name_Microsoft | 0.08 |
| brand_name_Motorola | -0.02 |
| brand_name_Nokia | 0.06 |
| brand_name_OnePlus | 0.09 |
| brand_name_Oppo | 0.01 |
| brand_name_Others | -0.02 |
| brand_name_Panasonic | 0.06 |
| brand_name_Realme | 0.01 |
| brand_name_Samsung | -0.03 |
| brand_name_Sony | -0.07 |
| brand_name_Spice | -0.02 |
| brand_name_Vivo | -0.02 |
| brand_name_XOLO | 0.03 |
| brand_name_Xiaomi | 0.08 |
| brand_name_ZTE | -0.01 |
| os_Others | -0.05 |
| os_Windows | -0.00 |
| os_iOS | -0.16 |
| 4g_yes | 0.05 |
| 5g_yes | -0.01 |
| Intercept | -42.93 |
Linear Regression using stats model
x_train1 = x_train1.astype(float) # Convert all columns to float
x_train1.dtypes
const float64 screen_size float64 main_camera_mp float64 selfie_camera_mp float64 int_memory float64 battery float64 release_year float64 days_used float64 normalized_new_price float64 weight_log float64 brand_name_Alcatel float64 brand_name_Apple float64 brand_name_Asus float64 brand_name_BlackBerry float64 brand_name_Celkon float64 brand_name_Coolpad float64 brand_name_Gionee float64 brand_name_Google float64 brand_name_HTC float64 brand_name_Honor float64 brand_name_Huawei float64 brand_name_Infinix float64 brand_name_Karbonn float64 brand_name_LG float64 brand_name_Lava float64 brand_name_Lenovo float64 brand_name_Meizu float64 brand_name_Micromax float64 brand_name_Microsoft float64 brand_name_Motorola float64 brand_name_Nokia float64 brand_name_OnePlus float64 brand_name_Oppo float64 brand_name_Others float64 brand_name_Panasonic float64 brand_name_Realme float64 brand_name_Samsung float64 brand_name_Sony float64 brand_name_Spice float64 brand_name_Vivo float64 brand_name_XOLO float64 brand_name_Xiaomi float64 brand_name_ZTE float64 os_Others float64 os_Windows float64 os_iOS float64 4g_yes float64 5g_yes float64 dtype: object
olsmodel1st = sm.OLS(y_train, x_train1)
olsres = olsmodel1st.fit()
print(olsres.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.846
Model: OLS Adj. R-squared: 0.843
Method: Least Squares F-statistic: 276.6
Date: Sat, 11 May 2024 Prob (F-statistic): 0.00
Time: 00:51:36 Log-Likelihood: 131.29
No. Observations: 2417 AIC: -166.6
Df Residuals: 2369 BIC: 111.4
Df Model: 47
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
const -42.9269 9.074 -4.731 0.000 -60.721 -25.133
screen_size 0.0206 0.003 6.328 0.000 0.014 0.027
main_camera_mp 0.0208 0.001 13.892 0.000 0.018 0.024
selfie_camera_mp 0.0146 0.001 13.329 0.000 0.012 0.017
int_memory 9.452e-05 6.92e-05 1.367 0.172 -4.11e-05 0.000
battery -2.097e-05 7.23e-06 -2.900 0.004 -3.52e-05 -6.79e-06
release_year 0.0212 0.004 4.730 0.000 0.012 0.030
days_used 4.287e-05 3.07e-05 1.396 0.163 -1.73e-05 0.000
normalized_new_price 0.4367 0.012 36.629 0.000 0.413 0.460
weight_log 0.3212 0.033 9.620 0.000 0.256 0.387
brand_name_Alcatel 0.0084 0.047 0.178 0.859 -0.085 0.101
brand_name_Apple 0.0851 0.147 0.581 0.561 -0.202 0.372
brand_name_Asus 0.0063 0.048 0.132 0.895 -0.087 0.100
brand_name_BlackBerry -0.0669 0.070 -0.955 0.340 -0.204 0.070
brand_name_Celkon -0.0881 0.066 -1.340 0.180 -0.217 0.041
brand_name_Coolpad 0.0184 0.073 0.253 0.800 -0.124 0.161
brand_name_Gionee 0.0388 0.058 0.675 0.500 -0.074 0.152
brand_name_Google -0.0238 0.084 -0.283 0.777 -0.189 0.142
brand_name_HTC -0.0155 0.048 -0.324 0.746 -0.110 0.079
brand_name_Honor 0.0303 0.049 0.618 0.536 -0.066 0.126
brand_name_Huawei -0.0022 0.044 -0.051 0.960 -0.089 0.085
brand_name_Infinix -1.9465 0.175 -11.143 0.000 -2.289 -1.604
brand_name_Karbonn 0.1015 0.067 1.518 0.129 -0.030 0.233
brand_name_LG -0.0149 0.045 -0.330 0.742 -0.103 0.074
brand_name_Lava 0.0290 0.062 0.467 0.640 -0.093 0.151
brand_name_Lenovo 0.0392 0.045 0.870 0.385 -0.049 0.128
brand_name_Meizu -0.0088 0.056 -0.158 0.874 -0.118 0.101
brand_name_Micromax -0.0306 0.048 -0.642 0.521 -0.124 0.063
brand_name_Microsoft 0.0808 0.088 0.919 0.358 -0.092 0.253
brand_name_Motorola -0.0197 0.049 -0.399 0.690 -0.117 0.077
brand_name_Nokia 0.0561 0.051 1.096 0.273 -0.044 0.157
brand_name_OnePlus 0.0858 0.077 1.112 0.266 -0.065 0.237
brand_name_Oppo 0.0115 0.048 0.241 0.809 -0.082 0.105
brand_name_Others -0.0178 0.042 -0.424 0.672 -0.100 0.064
brand_name_Panasonic 0.0623 0.056 1.119 0.263 -0.047 0.171
brand_name_Realme 0.0120 0.061 0.195 0.845 -0.108 0.132
brand_name_Samsung -0.0279 0.043 -0.649 0.517 -0.112 0.057
brand_name_Sony -0.0666 0.050 -1.325 0.185 -0.165 0.032
brand_name_Spice -0.0194 0.063 -0.308 0.758 -0.143 0.104
brand_name_Vivo -0.0183 0.048 -0.380 0.704 -0.113 0.076
brand_name_XOLO 0.0328 0.055 0.599 0.549 -0.074 0.140
brand_name_Xiaomi 0.0813 0.048 1.694 0.090 -0.013 0.175
brand_name_ZTE -0.0068 0.047 -0.144 0.886 -0.099 0.086
os_Others -0.0532 0.031 -1.699 0.090 -0.115 0.008
os_Windows -0.0010 0.045 -0.022 0.983 -0.089 0.087
os_iOS -0.1599 0.146 -1.098 0.273 -0.446 0.126
4g_yes 0.0500 0.016 3.172 0.002 0.019 0.081
5g_yes -0.0094 0.028 -0.338 0.736 -0.064 0.045
==============================================================================
Omnibus: 241.482 Durbin-Watson: 1.916
Prob(Omnibus): 0.000 Jarque-Bera (JB): 462.194
Skew: -0.656 Prob(JB): 4.32e-101
Kurtosis: 4.693 Cond. No. 7.61e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.61e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
# compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
r2 = r2_score(targets, predictions)
n = predictors.shape[0]
k = predictors.shape[1]
return 1 - ((1 - r2) * (n - 1) / (n - k - 1))
# compute MAPE
def mape_score(targets, predictions):
return np.mean(np.abs(targets - predictions) / targets) * 100
# compute multiple metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
"""
Function to compute different metrics to check regression model performance
model: regressor
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
r2 = r2_score(target, pred) # to compute R-squared
adjr2 = adj_r2_score(predictors, target, pred) # to compute adjusted R-squared
rmse = np.sqrt(mean_squared_error(target, pred)) # to compute RMSE
mae = mean_absolute_error(target, pred) # to compute MAE
mape = mape_score(target, pred) # to compute MAPE
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"RMSE": rmse,
"MAE": mae,
"R-squared": r2,
"Adj. R-squared": adjr2,
"MAPE": mape,
},
index=[0],
)
return df_perf
# Checking model performance on the train set (70% data)
print("Performance Training\n")
olsmodel1st_train_perf = model_performance_regression(olsres, x_train1, y_train)
print(olsmodel1st_train_perf)
Performance Training RMSE MAE R-squared Adj. R-squared MAPE 0 0.23 0.18 0.85 0.84 4.30
# Checking model performance on the train set (30% data)
print("Performance Test\n")
olsmodel1st_test_perf = model_performance_regression(olsres, x_test1, y_test)
print(olsmodel1st_test_perf)
Performance Test RMSE MAE R-squared Adj. R-squared MAPE 0 0.24 0.18 0.84 0.84 4.49
Interpreting the Regression Results:
Adjusted. R-squared: It reflects the fit of the model.
Adjusted R-squared values generally range from 0 to 1, where a higher value generally indicates a better fit, assuming certain conditions are met.
In our case, the value for adj. R-squared is 0.846, which is good!
const coefficient: It is the Y-intercept.
It means that if all the predictor variable coefficients are zero, then the expected output (i.e., Y) would be equal to the const coefficient.
In our case, the value for const coefficient is -42.9269.
Coefficient of a predictor variable: It represents the change in the output Y due to a change in the predictor variable (everything else held constant).
In our case, several features have negative coefficients. That means increases in these features reduces the price of the used device in very less values.
The positive coefficients mean increases in those features will increase the price of used device.
Linear Regression Assumptions:
Multicollinearity: This occurs when there is a correlation between multiple independent variables in a multiple regression model. Essentially, it means that some of the predictor variables are related to each other, which can adversely affect the regression results.
Variance Inflation Factor (VIF): The VIF quantifies how much the variance of a regression coefficient estimate is increased due to collinearity. Specifically, it compares the variance of a parameter estimate when fitting a full model (including other parameters) to the variance of the parameter estimate when fitting the model with only that parameter.
General Rule of Thumb:
from statsmodels.stats.outliers_influence import variance_inflation_factor
# we will define a function to check VIF
def checking_vif(predictors):
vif = pd.DataFrame()
vif["feature"] = predictors.columns
# calculating VIF for each feature
vif["VIF"] = [
variance_inflation_factor(predictors.values, i)
for i in range(len(predictors.columns))
]
return vif
checking_vif(x_train1) # checking VIF on train1
| feature | VIF | |
|---|---|---|
| 0 | const | 3713721.46 |
| 1 | screen_size | 7.09 |
| 2 | main_camera_mp | 5.50 |
| 3 | selfie_camera_mp | 2.68 |
| 4 | int_memory | 1.35 |
| 5 | battery | 4.06 |
| 6 | release_year | 4.78 |
| 7 | days_used | 2.65 |
| 8 | normalized_new_price | 2.97 |
| 9 | weight_log | 6.12 |
| 10 | brand_name_Alcatel | 3.40 |
| 11 | brand_name_Apple | 13.04 |
| 12 | brand_name_Asus | 3.33 |
| 13 | brand_name_BlackBerry | 1.63 |
| 14 | brand_name_Celkon | 1.76 |
| 15 | brand_name_Coolpad | 1.47 |
| 16 | brand_name_Gionee | 1.95 |
| 17 | brand_name_Google | 1.32 |
| 18 | brand_name_HTC | 3.41 |
| 19 | brand_name_Honor | 3.34 |
| 20 | brand_name_Huawei | 5.98 |
| 21 | brand_name_Infinix | 4.54 |
| 22 | brand_name_Karbonn | 1.57 |
| 23 | brand_name_LG | 4.84 |
| 24 | brand_name_Lava | 1.71 |
| 25 | brand_name_Lenovo | 4.56 |
| 26 | brand_name_Meizu | 2.18 |
| 27 | brand_name_Micromax | 3.36 |
| 28 | brand_name_Microsoft | 1.87 |
| 29 | brand_name_Motorola | 3.27 |
| 30 | brand_name_Nokia | 3.42 |
| 31 | brand_name_OnePlus | 1.44 |
| 32 | brand_name_Oppo | 3.97 |
| 33 | brand_name_Others | 9.71 |
| 34 | brand_name_Panasonic | 2.11 |
| 35 | brand_name_Realme | 1.94 |
| 36 | brand_name_Samsung | 7.54 |
| 37 | brand_name_Sony | 2.94 |
| 38 | brand_name_Spice | 1.69 |
| 39 | brand_name_Vivo | 3.65 |
| 40 | brand_name_XOLO | 2.14 |
| 41 | brand_name_Xiaomi | 3.72 |
| 42 | brand_name_ZTE | 3.80 |
| 43 | os_Others | 1.71 |
| 44 | os_Windows | 1.59 |
| 45 | os_iOS | 11.73 |
| 46 | 4g_yes | 2.45 |
| 47 | 5g_yes | 1.42 |
There are several columns with high VIF so there is definitely multicollinearity.
High VIF columns will need to be dropped.
Columns higher than 10 are brand_name_Apple and os_IOS.
Columns between 5 and 10 are screen_size, main_camera_mp, weight_log, brand_name_Huawei, brand_name_Others, and brand_name_Samsung.
def treating_multicollinearity(predictors, target, high_vif_columns):
"""
Checking the effect of dropping the columns showing high multicollinearity
on model performance (adj. R-squared and RMSE)
predictors: independent variables
target: dependent variable
high_vif_columns: columns having high VIF
"""
# empty lists to store adj. R-squared and RMSE values
adj_r2 = []
rmse = []
# build ols models by dropping one of the high VIF columns at a time
# store the adjusted R-squared and RMSE in the lists defined previously
for cols in high_vif_columns:
# defining the new train set
train = predictors.loc[:, ~predictors.columns.str.startswith(cols)]
# create the model
olsmodel = sm.OLS(target, train).fit()
# adding adj. R-squared and RMSE to the lists
adj_r2.append(olsmodel.rsquared_adj)
rmse.append(np.sqrt(olsmodel.mse_resid))
# creating a dataframe for the results
temp = pd.DataFrame(
{
"col": high_vif_columns,
"Adj. R-squared after_dropping col": adj_r2,
"RMSE after dropping col": rmse,
}
).sort_values(by="Adj. R-squared after_dropping col", ascending=False)
temp.reset_index(drop=True, inplace=True)
return temp
pd.set_option("display.float_format", lambda x: "{:.6f}".format(x))
col_list = [
"screen_size",
"main_camera_mp",
"weight_log",
"brand_name_Apple",
"brand_name_Huwawei",
"brand_name_Others",
"brand_name_Samsung",
"os_ios"
]
res = treating_multicollinearity(x_train1, y_train, col_list)
res
| col | Adj. R-squared after_dropping col | RMSE after dropping col | |
|---|---|---|---|
| 0 | brand_name_Others | 0.842833 | 0.231448 |
| 1 | brand_name_Apple | 0.842823 | 0.231455 |
| 2 | brand_name_Samsung | 0.842817 | 0.231460 |
| 3 | brand_name_Huwawei | 0.842779 | 0.231488 |
| 4 | os_ios | 0.842779 | 0.231488 |
| 5 | screen_size | 0.840189 | 0.233387 |
| 6 | weight_log | 0.836707 | 0.235916 |
| 7 | main_camera_mp | 0.830043 | 0.240681 |
# dropping "brand_name_Others" and check of VIF again.
col_to_drop = "brand_name_Others"
x_train2 = x_train1.loc[:, ~x_train1.columns.str.startswith(col_to_drop)]
x_test2 = x_test1.loc[:, ~x_test1.columns.str.startswith(col_to_drop)]
# Check VIF now
vif = checking_vif(x_train2)
print("VIF after dropping ", col_to_drop)
vif
VIF after dropping brand_name_Others
| feature | VIF | |
|---|---|---|
| 0 | const | 3713060.846677 |
| 1 | screen_size | 7.042135 |
| 2 | main_camera_mp | 5.496114 |
| 3 | selfie_camera_mp | 2.679081 |
| 4 | int_memory | 1.349697 |
| 5 | battery | 4.063285 |
| 6 | release_year | 4.778269 |
| 7 | days_used | 2.649911 |
| 8 | normalized_new_price | 2.967449 |
| 9 | weight_log | 6.100547 |
| 10 | brand_name_Alcatel | 1.218717 |
| 11 | brand_name_Apple | 12.187532 |
| 12 | brand_name_Asus | 1.203185 |
| 13 | brand_name_BlackBerry | 1.128476 |
| 14 | brand_name_Celkon | 1.160724 |
| 15 | brand_name_Coolpad | 1.054235 |
| 16 | brand_name_Gionee | 1.090693 |
| 17 | brand_name_Google | 1.046893 |
| 18 | brand_name_HTC | 1.224358 |
| 19 | brand_name_Honor | 1.284828 |
| 20 | brand_name_Huawei | 1.509942 |
| 21 | brand_name_Infinix | 4.266932 |
| 22 | brand_name_Karbonn | 1.076962 |
| 23 | brand_name_LG | 1.353189 |
| 24 | brand_name_Lava | 1.071322 |
| 25 | brand_name_Lenovo | 1.301168 |
| 26 | brand_name_Meizu | 1.136850 |
| 27 | brand_name_Micromax | 1.227856 |
| 28 | brand_name_Microsoft | 1.494295 |
| 29 | brand_name_Motorola | 1.258473 |
| 30 | brand_name_Nokia | 1.464409 |
| 31 | brand_name_OnePlus | 1.078806 |
| 32 | brand_name_Oppo | 1.376836 |
| 33 | brand_name_Panasonic | 1.107531 |
| 34 | brand_name_Realme | 1.165281 |
| 35 | brand_name_Samsung | 1.600712 |
| 36 | brand_name_Sony | 1.214444 |
| 37 | brand_name_Spice | 1.080180 |
| 38 | brand_name_Vivo | 1.328857 |
| 39 | brand_name_XOLO | 1.125674 |
| 40 | brand_name_Xiaomi | 1.313988 |
| 41 | brand_name_ZTE | 1.266371 |
| 42 | os_Others | 1.705367 |
| 43 | os_Windows | 1.587232 |
| 44 | os_iOS | 11.730873 |
| 45 | 4g_yes | 2.444886 |
| 46 | 5g_yes | 1.423004 |
Removing brand_name_Others time changed the VIF for brand_name_Apple, brand_name_Huawei, and brand_name_Samsung.
#Removing from the list VIFs less than 5
col_list = [
"screen_size",
"main_camera_mp",
"weight_log",
"brand_name_Apple",
"os_ios"
]
res = treating_multicollinearity(x_train2, y_train, col_list)
res
| col | Adj. R-squared after_dropping col | RMSE after dropping col | |
|---|---|---|---|
| 0 | brand_name_Apple | 0.842866 | 0.231424 |
| 1 | os_ios | 0.842833 | 0.231448 |
| 2 | screen_size | 0.840200 | 0.233378 |
| 3 | weight_log | 0.836774 | 0.235867 |
| 4 | main_camera_mp | 0.830115 | 0.240630 |
There was minimal change for the R-squared and RMSE
# dropping "oos_iOS" and check of VIF again.
col_to_drop = "os_iOS"
x_train3 = x_train2.loc[:, ~x_train2.columns.str.startswith(col_to_drop)]
x_test3 = x_test2.loc[:, ~x_test2.columns.str.startswith(col_to_drop)]
# Check VIF now
vif = checking_vif(x_train3)
print("VIF after dropping ", col_to_drop)
vif
VIF after dropping os_iOS
| feature | VIF | |
|---|---|---|
| 0 | const | 3711987.203394 |
| 1 | screen_size | 6.961123 |
| 2 | main_camera_mp | 5.491045 |
| 3 | selfie_camera_mp | 2.673669 |
| 4 | int_memory | 1.349599 |
| 5 | battery | 4.057882 |
| 6 | release_year | 4.776775 |
| 7 | days_used | 2.649412 |
| 8 | normalized_new_price | 2.965480 |
| 9 | weight_log | 6.068362 |
| 10 | brand_name_Alcatel | 1.218643 |
| 11 | brand_name_Apple | 1.215072 |
| 12 | brand_name_Asus | 1.203077 |
| 13 | brand_name_BlackBerry | 1.125644 |
| 14 | brand_name_Celkon | 1.160591 |
| 15 | brand_name_Coolpad | 1.054230 |
| 16 | brand_name_Gionee | 1.090658 |
| 17 | brand_name_Google | 1.046759 |
| 18 | brand_name_HTC | 1.224278 |
| 19 | brand_name_Honor | 1.284312 |
| 20 | brand_name_Huawei | 1.509344 |
| 21 | brand_name_Infinix | 4.262662 |
| 22 | brand_name_Karbonn | 1.076832 |
| 23 | brand_name_LG | 1.353189 |
| 24 | brand_name_Lava | 1.071212 |
| 25 | brand_name_Lenovo | 1.301095 |
| 26 | brand_name_Meizu | 1.136828 |
| 27 | brand_name_Micromax | 1.227826 |
| 28 | brand_name_Microsoft | 1.493899 |
| 29 | brand_name_Motorola | 1.258416 |
| 30 | brand_name_Nokia | 1.456619 |
| 31 | brand_name_OnePlus | 1.078729 |
| 32 | brand_name_Oppo | 1.376836 |
| 33 | brand_name_Panasonic | 1.107508 |
| 34 | brand_name_Realme | 1.164948 |
| 35 | brand_name_Samsung | 1.599714 |
| 36 | brand_name_Sony | 1.214436 |
| 37 | brand_name_Spice | 1.079532 |
| 38 | brand_name_Vivo | 1.328477 |
| 39 | brand_name_XOLO | 1.125607 |
| 40 | brand_name_Xiaomi | 1.313743 |
| 41 | brand_name_ZTE | 1.266319 |
| 42 | os_Others | 1.609764 |
| 43 | os_Windows | 1.584997 |
| 44 | 4g_yes | 2.443500 |
| 45 | 5g_yes | 1.422500 |
Dropping os_iOS brought the VIP for brand_name_Apple down to 1.20.
#Removing from the list VIFs less than 5
col_list = [
"screen_size",
"main_camera_mp",
"weight_log",
]
res = treating_multicollinearity(x_train3, y_train, col_list)
res
| col | Adj. R-squared after_dropping col | RMSE after dropping col | |
|---|---|---|---|
| 0 | screen_size | 0.840055 | 0.233484 |
| 1 | weight_log | 0.836833 | 0.235825 |
| 2 | main_camera_mp | 0.830034 | 0.240688 |
Not seeing any big changes from the removal of os_iOS.
# dropping "screen_size" and check of VIF again.
col_to_drop = "screen_size"
x_train4 = x_train3.loc[:, ~x_train3.columns.str.startswith(col_to_drop)]
x_test4 = x_test3.loc[:, ~x_test3.columns.str.startswith(col_to_drop)]
# Check VIF now
vif = checking_vif(x_train4)
print("VIF after dropping ", col_to_drop)
vif
VIF after dropping screen_size
| feature | VIF | |
|---|---|---|
| 0 | const | 3642036.237228 |
| 1 | main_camera_mp | 5.490024 |
| 2 | selfie_camera_mp | 2.673225 |
| 3 | int_memory | 1.347985 |
| 4 | battery | 3.729073 |
| 5 | release_year | 4.698296 |
| 6 | days_used | 2.644402 |
| 7 | normalized_new_price | 2.942854 |
| 8 | weight_log | 3.105616 |
| 9 | brand_name_Alcatel | 1.202472 |
| 10 | brand_name_Apple | 1.213353 |
| 11 | brand_name_Asus | 1.202095 |
| 12 | brand_name_BlackBerry | 1.125372 |
| 13 | brand_name_Celkon | 1.159376 |
| 14 | brand_name_Coolpad | 1.052585 |
| 15 | brand_name_Gionee | 1.089449 |
| 16 | brand_name_Google | 1.046755 |
| 17 | brand_name_HTC | 1.223400 |
| 18 | brand_name_Honor | 1.267997 |
| 19 | brand_name_Huawei | 1.496721 |
| 20 | brand_name_Infinix | 4.262179 |
| 21 | brand_name_Karbonn | 1.068689 |
| 22 | brand_name_LG | 1.351833 |
| 23 | brand_name_Lava | 1.068629 |
| 24 | brand_name_Lenovo | 1.296695 |
| 25 | brand_name_Meizu | 1.135736 |
| 26 | brand_name_Micromax | 1.225362 |
| 27 | brand_name_Microsoft | 1.493735 |
| 28 | brand_name_Motorola | 1.258410 |
| 29 | brand_name_Nokia | 1.455713 |
| 30 | brand_name_OnePlus | 1.077064 |
| 31 | brand_name_Oppo | 1.373142 |
| 32 | brand_name_Panasonic | 1.104337 |
| 33 | brand_name_Realme | 1.164948 |
| 34 | brand_name_Samsung | 1.591812 |
| 35 | brand_name_Sony | 1.213313 |
| 36 | brand_name_Spice | 1.079277 |
| 37 | brand_name_Vivo | 1.320419 |
| 38 | brand_name_XOLO | 1.118077 |
| 39 | brand_name_Xiaomi | 1.312963 |
| 40 | brand_name_ZTE | 1.264329 |
| 41 | os_Others | 1.501386 |
| 42 | os_Windows | 1.584486 |
| 43 | 4g_yes | 2.437756 |
| 44 | 5g_yes | 1.417622 |
weight_log VIF went to 3.12, main_camera_mp did not change.
#Removing from the list VIFs less than 5
col_list = [
"main_camera_mp",
"weight_log",
]
res = treating_multicollinearity(x_train4, y_train, col_list)
res
| col | Adj. R-squared after_dropping col | RMSE after dropping col | |
|---|---|---|---|
| 0 | main_camera_mp | 0.827108 | 0.242750 |
| 1 | weight_log | 0.814299 | 0.251582 |
Not removing screen_size because it lowers the r-squared and RMSE.
# dropping "weight_log" and check of VIF again.
col_to_drop = "weight_log"
x_train5 = x_train3.loc[:, ~x_train3.columns.str.startswith(col_to_drop)]
x_test5 = x_test3.loc[:, ~x_test3.columns.str.startswith(col_to_drop)]
# Check VIF now
vif = checking_vif(x_train5)
print("VIF after dropping ", col_to_drop)
vif
VIF after dropping weight_log
| feature | VIF | |
|---|---|---|
| 0 | const | 3658621.476742 |
| 1 | screen_size | 3.562506 |
| 2 | main_camera_mp | 5.296387 |
| 3 | selfie_camera_mp | 2.657013 |
| 4 | int_memory | 1.349077 |
| 5 | battery | 3.684865 |
| 6 | release_year | 4.723441 |
| 7 | days_used | 2.645465 |
| 8 | normalized_new_price | 2.944632 |
| 9 | brand_name_Alcatel | 1.210645 |
| 10 | brand_name_Apple | 1.215033 |
| 11 | brand_name_Asus | 1.202666 |
| 12 | brand_name_BlackBerry | 1.125436 |
| 13 | brand_name_Celkon | 1.160589 |
| 14 | brand_name_Coolpad | 1.052807 |
| 15 | brand_name_Gionee | 1.087644 |
| 16 | brand_name_Google | 1.045393 |
| 17 | brand_name_HTC | 1.220174 |
| 18 | brand_name_Honor | 1.269667 |
| 19 | brand_name_Huawei | 1.496749 |
| 20 | brand_name_Infinix | 4.177190 |
| 21 | brand_name_Karbonn | 1.071017 |
| 22 | brand_name_LG | 1.340184 |
| 23 | brand_name_Lava | 1.066652 |
| 24 | brand_name_Lenovo | 1.297118 |
| 25 | brand_name_Meizu | 1.133968 |
| 26 | brand_name_Micromax | 1.220541 |
| 27 | brand_name_Microsoft | 1.493859 |
| 28 | brand_name_Motorola | 1.258245 |
| 29 | brand_name_Nokia | 1.451806 |
| 30 | brand_name_OnePlus | 1.075772 |
| 31 | brand_name_Oppo | 1.365637 |
| 32 | brand_name_Panasonic | 1.102077 |
| 33 | brand_name_Realme | 1.164280 |
| 34 | brand_name_Samsung | 1.581888 |
| 35 | brand_name_Sony | 1.214038 |
| 36 | brand_name_Spice | 1.079528 |
| 37 | brand_name_Vivo | 1.314890 |
| 38 | brand_name_XOLO | 1.108586 |
| 39 | brand_name_Xiaomi | 1.309304 |
| 40 | brand_name_ZTE | 1.261974 |
| 41 | os_Others | 1.588377 |
| 42 | os_Windows | 1.584635 |
| 43 | 4g_yes | 2.435027 |
| 44 | 5g_yes | 1.421179 |
#Removing from the list VIFs less than 5
col_list = [
"screen_size",
"main_camera_mp",
]
res = treating_multicollinearity(x_train5, y_train, col_list)
res
| col | Adj. R-squared after_dropping col | RMSE after dropping col | |
|---|---|---|---|
| 0 | main_camera_mp | 0.826802 | 0.242965 |
| 1 | screen_size | 0.814299 | 0.251582 |
Not removing weight_log because it lowers the r-squared and RMSE.
col_list = [
"screen_size",
"weight_log",
"main_camera_mp",
]
res = treating_multicollinearity(x_train3, y_train, col_list)
res
| col | Adj. R-squared after_dropping col | RMSE after dropping col | |
|---|---|---|---|
| 0 | screen_size | 0.840055 | 0.233484 |
| 1 | weight_log | 0.836833 | 0.235825 |
| 2 | main_camera_mp | 0.830034 | 0.240688 |
Not removing weight_log it caused the R-squared and RMSE to go down.
# dropping "main_camera_mp" and check of VIF again.
col_to_drop = "main_camera_mp"
x_train6 = x_train3.loc[:, ~x_train3.columns.str.startswith(col_to_drop)]
x_test6 = x_test3.loc[:, ~x_test3.columns.str.startswith(col_to_drop)]
# Check VIF now
vif = checking_vif(x_train6)
print("VIF after dropping ", col_to_drop)
vif
VIF after dropping main_camera_mp
| feature | VIF | |
|---|---|---|
| 0 | const | 3678639.957204 |
| 1 | screen_size | 6.959828 |
| 2 | selfie_camera_mp | 2.651766 |
| 3 | int_memory | 1.328239 |
| 4 | battery | 4.021472 |
| 5 | release_year | 4.732022 |
| 6 | days_used | 2.627358 |
| 7 | normalized_new_price | 2.523537 |
| 8 | weight_log | 5.853238 |
| 9 | brand_name_Alcatel | 1.213304 |
| 10 | brand_name_Apple | 1.206641 |
| 11 | brand_name_Asus | 1.202294 |
| 12 | brand_name_BlackBerry | 1.125119 |
| 13 | brand_name_Celkon | 1.157179 |
| 14 | brand_name_Coolpad | 1.053651 |
| 15 | brand_name_Gionee | 1.090658 |
| 16 | brand_name_Google | 1.046274 |
| 17 | brand_name_HTC | 1.224182 |
| 18 | brand_name_Honor | 1.277908 |
| 19 | brand_name_Huawei | 1.508807 |
| 20 | brand_name_Infinix | 1.055217 |
| 21 | brand_name_Karbonn | 1.074747 |
| 22 | brand_name_LG | 1.337416 |
| 23 | brand_name_Lava | 1.071120 |
| 24 | brand_name_Lenovo | 1.300870 |
| 25 | brand_name_Meizu | 1.131783 |
| 26 | brand_name_Micromax | 1.227795 |
| 27 | brand_name_Microsoft | 1.493299 |
| 28 | brand_name_Motorola | 1.233141 |
| 29 | brand_name_Nokia | 1.447914 |
| 30 | brand_name_OnePlus | 1.078523 |
| 31 | brand_name_Oppo | 1.374457 |
| 32 | brand_name_Panasonic | 1.106172 |
| 33 | brand_name_Realme | 1.153660 |
| 34 | brand_name_Samsung | 1.587014 |
| 35 | brand_name_Sony | 1.186538 |
| 36 | brand_name_Spice | 1.079493 |
| 37 | brand_name_Vivo | 1.325080 |
| 38 | brand_name_XOLO | 1.124899 |
| 39 | brand_name_Xiaomi | 1.303664 |
| 40 | brand_name_ZTE | 1.257817 |
| 41 | os_Others | 1.601495 |
| 42 | os_Windows | 1.584996 |
| 43 | 4g_yes | 2.328379 |
| 44 | 5g_yes | 1.391042 |
weight_log goes down some, screen_size stays the same
#Removing from the list VIFs less than 5
col_list = [
"screen_size",
"weight_log",
"main_camera_mp"
]
res = treating_multicollinearity(x_train6, y_train, col_list)
res
| col | Adj. R-squared after_dropping col | RMSE after dropping col | |
|---|---|---|---|
| 0 | main_camera_mp | 0.830034 | 0.240688 |
| 1 | screen_size | 0.827108 | 0.242750 |
| 2 | weight_log | 0.826802 | 0.242965 |
#Removing from the list VIFs less than 5
col_list = [
"screen_size",
"weight_log",
"main_camera_mp"
]
res = treating_multicollinearity(x_train3, y_train, col_list)
res
| col | Adj. R-squared after_dropping col | RMSE after dropping col | |
|---|---|---|---|
| 0 | screen_size | 0.840055 | 0.233484 |
| 1 | weight_log | 0.836833 | 0.235825 |
| 2 | main_camera_mp | 0.830034 | 0.240688 |
Will be using x_train3 as our final train.
olsmode2nd = sm.OLS(y_train, x_train3)
olsres2 = olsmode2nd.fit()
print(olsres2.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.846
Model: OLS Adj. R-squared: 0.843
Method: Least Squares F-statistic: 288.9
Date: Sat, 11 May 2024 Prob (F-statistic): 0.00
Time: 00:51:40 Log-Likelihood: 130.59
No. Observations: 2417 AIC: -169.2
Df Residuals: 2371 BIC: 97.18
Df Model: 45
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
const -43.0446 9.071 -4.746 0.000 -60.832 -25.257
screen_size 0.0211 0.003 6.536 0.000 0.015 0.027
main_camera_mp 0.0208 0.001 13.927 0.000 0.018 0.024
selfie_camera_mp 0.0145 0.001 13.297 0.000 0.012 0.017
int_memory 9.314e-05 6.91e-05 1.347 0.178 -4.24e-05 0.000
battery -2.068e-05 7.23e-06 -2.862 0.004 -3.49e-05 -6.51e-06
release_year 0.0213 0.004 4.744 0.000 0.012 0.030
days_used 4.255e-05 3.07e-05 1.386 0.166 -1.76e-05 0.000
normalized_new_price 0.4371 0.012 36.678 0.000 0.414 0.460
weight_log 0.3177 0.033 9.558 0.000 0.253 0.383
brand_name_Alcatel 0.0243 0.028 0.856 0.392 -0.031 0.080
brand_name_Apple -0.0462 0.045 -1.033 0.302 -0.134 0.042
brand_name_Asus 0.0222 0.029 0.774 0.439 -0.034 0.078
brand_name_BlackBerry -0.0535 0.058 -0.922 0.357 -0.167 0.060
brand_name_Celkon -0.0725 0.053 -1.358 0.175 -0.177 0.032
brand_name_Coolpad 0.0349 0.062 0.566 0.571 -0.086 0.156
brand_name_Gionee 0.0553 0.043 1.285 0.199 -0.029 0.140
brand_name_Google -0.0085 0.075 -0.113 0.910 -0.156 0.139
brand_name_HTC 0.0010 0.029 0.035 0.972 -0.055 0.057
brand_name_Honor 0.0459 0.030 1.511 0.131 -0.014 0.105
brand_name_Huawei 0.0135 0.022 0.607 0.544 -0.030 0.057
brand_name_Infinix -1.9342 0.169 -11.429 0.000 -2.266 -1.602
brand_name_Karbonn 0.1181 0.055 2.135 0.033 0.010 0.227
brand_name_LG 0.0013 0.024 0.056 0.955 -0.045 0.048
brand_name_Lava 0.0456 0.049 0.929 0.353 -0.051 0.142
brand_name_Lenovo 0.0551 0.024 2.290 0.022 0.008 0.102
brand_name_Meizu 0.0073 0.040 0.182 0.856 -0.072 0.086
brand_name_Micromax -0.0144 0.029 -0.499 0.618 -0.071 0.042
brand_name_Microsoft 0.0961 0.079 1.222 0.222 -0.058 0.250
brand_name_Motorola -0.0035 0.031 -0.115 0.908 -0.064 0.057
brand_name_Nokia 0.0698 0.033 2.089 0.037 0.004 0.135
brand_name_OnePlus 0.1014 0.067 1.517 0.129 -0.030 0.233
brand_name_Oppo 0.0277 0.028 0.990 0.322 -0.027 0.083
brand_name_Panasonic 0.0787 0.040 1.951 0.051 -0.000 0.158
brand_name_Realme 0.0275 0.047 0.580 0.562 -0.066 0.121
brand_name_Samsung -0.0123 0.020 -0.620 0.535 -0.051 0.027
brand_name_Sony -0.0504 0.032 -1.560 0.119 -0.114 0.013
brand_name_Spice -0.0020 0.050 -0.040 0.968 -0.101 0.097
brand_name_Vivo -0.0026 0.029 -0.088 0.930 -0.060 0.055
brand_name_XOLO 0.0490 0.040 1.237 0.216 -0.029 0.127
brand_name_Xiaomi 0.0972 0.029 3.409 0.001 0.041 0.153
brand_name_ZTE 0.0097 0.027 0.357 0.721 -0.044 0.063
os_Others -0.0451 0.030 -1.482 0.139 -0.105 0.015
os_Windows 0.0003 0.045 0.007 0.994 -0.088 0.088
4g_yes 0.0492 0.016 3.130 0.002 0.018 0.080
5g_yes -0.0099 0.028 -0.356 0.722 -0.064 0.045
==============================================================================
Omnibus: 242.033 Durbin-Watson: 1.917
Prob(Omnibus): 0.000 Jarque-Bera (JB): 459.992
Skew: -0.660 Prob(JB): 1.30e-100
Kurtosis: 4.681 Cond. No. 7.61e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.61e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
# initial list of columns
cols = x_train3.columns.tolist()
# setting an initial max p-value
max_p_value = 1
# Loop to check for p-values of the variables and drop the column with the highest p-value.
while len(cols) > 0:
# defining the train set
x_train_aux = x_train3[cols]
# fitting the model
model = sm.OLS(y_train, x_train_aux).fit()
# getting the p-values and the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
# name of the variable with maximum p-value
feature_with_p_max = p_values.idxmax()
if max_p_value > 0.05:
cols.remove(feature_with_p_max)
else:
break
selected_features = cols # variables with p-values lesser than 0.05
print(selected_features)
['const', 'screen_size', 'main_camera_mp', 'selfie_camera_mp', 'battery', 'release_year', 'normalized_new_price', 'weight_log', 'brand_name_Infinix', 'brand_name_Karbonn', 'brand_name_Lenovo', 'brand_name_Nokia', 'brand_name_Sony', 'brand_name_Xiaomi', 'os_Others', '4g_yes']
x_train7 = x_train3[selected_features]
x_test7 = x_test3[selected_features]
olsmode3rd = sm.OLS(y_train, x_train7)
olsres3 = olsmode3rd.fit()
print(olsres3.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.844
Model: OLS Adj. R-squared: 0.843
Method: Least Squares F-statistic: 866.5
Date: Sat, 11 May 2024 Prob (F-statistic): 0.00
Time: 00:51:40 Log-Likelihood: 117.57
No. Observations: 2417 AIC: -203.1
Df Residuals: 2401 BIC: -110.5
Df Model: 15
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const -37.5660 7.085 -5.302 0.000 -51.459 -23.673
screen_size 0.0219 0.003 6.929 0.000 0.016 0.028
main_camera_mp 0.0212 0.001 15.319 0.000 0.019 0.024
selfie_camera_mp 0.0149 0.001 14.571 0.000 0.013 0.017
battery -2.037e-05 7.1e-06 -2.870 0.004 -3.43e-05 -6.45e-06
release_year 0.0186 0.004 5.312 0.000 0.012 0.025
normalized_new_price 0.4340 0.010 42.818 0.000 0.414 0.454
weight_log 0.3050 0.032 9.419 0.000 0.241 0.368
brand_name_Infinix -1.9949 0.159 -12.564 0.000 -2.306 -1.684
brand_name_Karbonn 0.1081 0.054 1.999 0.046 0.002 0.214
brand_name_Lenovo 0.0467 0.021 2.180 0.029 0.005 0.089
brand_name_Nokia 0.0659 0.030 2.187 0.029 0.007 0.125
brand_name_Sony -0.0598 0.030 -1.992 0.046 -0.119 -0.001
brand_name_Xiaomi 0.0877 0.025 3.453 0.001 0.038 0.138
os_Others -0.0662 0.029 -2.319 0.020 -0.122 -0.010
4g_yes 0.0469 0.015 3.185 0.001 0.018 0.076
==============================================================================
Omnibus: 250.882 Durbin-Watson: 1.918
Prob(Omnibus): 0.000 Jarque-Bera (JB): 487.120
Skew: -0.673 Prob(JB): 1.67e-106
Kurtosis: 4.740 Cond. No. 5.87e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.87e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
Observation
Based on the R-squared going down on the olsres3, I will use olsres2 as my final model and the features in x_train 3 as my final set of predictor variables.
The adjusted R-squared in olsres3 was 84.4%. This shows that the variables we dropped had minimal affect on the model. The featured model went down ~.2%. olsres was 84.6%.
Our model is able to explain ~84.4% of the variance. This shows the model is good.
Why do we do this test?
Linearity describes a straight-line relationship between two variables, where predictor variables must have a linear relation with the dependent variable.
Assessing model assumptions is crucial because all the estimates, intervals, and hypothesis tests in regression analysis rely on the correctness of the model assumptions12. If any assumptions are violated, it’s essential to address them appropriately to ensure reliable results.
How to check linearity and independence?
Make a plot of fitted values vs residuals. If they don't follow any pattern, then we say the model is linear and residuals are independent. If they follow a pattern, the model is showing signs of non-linearity and residuals are not independent.
How to fix if this assumption is not followed?
We can try to transform the variables and make the relationships linear.
# let us create a dataframe with actual, fitted and residual values
df_pred = pd.DataFrame()
df_pred["Actual Values"] = y_train # actual values
df_pred["Fitted Values"] = olsres2.fittedvalues # predicted values
df_pred["Residuals"] = olsres2.resid # residuals
df_pred.head()
| Actual Values | Fitted Values | Residuals | |
|---|---|---|---|
| 3026 | 4.087488 | 3.857981 | 0.229507 |
| 1525 | 4.448399 | 4.658666 | -0.210266 |
| 1128 | 4.315353 | 4.326174 | -0.010821 |
| 3003 | 4.282068 | 4.222787 | 0.059281 |
| 2907 | 4.456438 | 4.462717 | -0.006279 |
# let's plot the fitted values vs residuals
sns.residplot(
data=df_pred, x="Fitted Values", y="Residuals", color="darkblue", lowess=True
)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Fitted vs Residual plot")
plt.show()
<Axes: xlabel='Fitted Values', ylabel='Residuals'>
Text(0.5, 0, 'Fitted Values')
Text(0, 0.5, 'Residuals')
Text(0.5, 1.0, 'Fitted vs Residual plot')
This scatter plot shows the distribution of residual(errors) vs fitted values (predicted values).
The Fitted Values are mainly between 1.2 and 1.7. Most of the residuals are between -0.2 and .1.
There is no pattern in the plot above.
The assumptions of linearity and independence are satisfied.
Why do the test?
Error Terms and Normality In linear regression, we assume that the error terms (also known as residuals) follow a normal distribution. These error terms represent the difference between the observed values and the predicted values from the regression model. When the error terms are normally distributed, it implies that the model assumptions are met, and our statistical inferences (such as confidence intervals and hypothesis tests) are valid. However, if the error terms deviate significantly from normality, it can impact the reliability of our regression results.
Checking Normality:
One way to assess the normality of error terms is by creating a normal probability plot of the residuals. Here’s how it works:
If the plot deviates significantly from linearity, it suggests non-normality of the error terms.
Interpretation of Normal Probability Plot:
Non-Normality: If the plot shows deviations (e.g., curvature or outliers), it indicates non-normality. In such cases:
Additional Considerations:
sns.histplot(data=df_pred, x="Residuals", kde=True, color= "darkgreen" )
plt.title("Normality of residuals")
plt.show()
<Axes: xlabel='Residuals', ylabel='Count'>
Text(0.5, 1.0, 'Normality of residuals')
The histogram has a bell shape.
Appears to be a normal distribution.
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
# Assuming 'df_pred' contains the data you want to plot
# Replace 'data' with your actual data
data = df_pred["Residuals"]
# Fit the distribution (you already have this part)
d = getattr(stats, 'norm')
param = d.fit(data)
# Create the probability plot
fig = plt.figure(figsize=[6, 6], dpi=100)
ax = fig.add_subplot(111)
res = stats.probplot(data, dist='norm', sparams=param, plot=plt, fit=True)
# Customize the markers
ax.get_lines()[0].set_marker('*') # Set marker style (e.g., 'p' for pentagon)
ax.get_lines()[0].set_markerfacecolor('tab:red') # Set marker color
ax.get_lines()[0].set_markersize(10.0) # Set marker size
# Customize other plot features
ax.set_title("") # Remove title
ax.set_xlabel("Quantiles", fontsize=20, fontweight='bold')
ax.set_ylabel("Ordered Values", fontsize=20, fontweight='bold')
plt.show()
Text(0.5, 1.0, '')
Text(0.5, 0, 'Quantiles')
Text(0, 0.5, 'Ordered Values')
The residuals for the most part are a straight line except for the tails.
stats.shapiro(df_pred["Residuals"])
ShapiroResult(statistic=0.9687750935554504, pvalue=1.644691178803239e-22)
Since p-value < 0.05, the residuals are not normal as per the Shapiro-Wilk test.
Per the test the residuals are not normal.
Since this iss an approximation, we can accept this distribution as close to being normal.
As a result, the assumption is satisfied.
Homoscedasticity:
Homoscedasticity refers to the condition where the dispersion of error terms (or residuals) remains consistent across the full range of values of the independent variable(s).
When residuals exhibit homoscedasticity, statistical inferences are more reliable and valid.
Heteroscedasticity:
Heteroscedasticity occurs when the dispersion of error terms does not maintain consistency across all levels of the independent variable(s).
Heteroscedasticity can lead to unreliable test statistics, standard errors, and hypothesis tests.
Checking for Homoscedasticity:
The residual versus fitted values plot provides a visual way to assess homoscedasticity. In cases of heteroscedasticity, the residuals may exhibit non-symmetrical shapes, such as forming an arrow pattern
If the p-value from the Goldfeld-Quandt test is greater than 0.05, we fail to reject the null hypothesis of homoscedasticity. In other words, we do not have sufficient evidence to conclude that the variance of the errors (residuals) differs across different subsets of the data.
# goldfeldquandt test for homoscedasticity
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
name = ["F statistic", "p-value"]
test = sms.het_goldfeldquandt(df_pred["Residuals"], x_train3)
lzip(name, test)
[('F statistic', 1.069816803115121), ('p-value', 0.12503805606566432)]
Since p-value > 0.05, we can say that the residuals are homoscedastic. The assumption is satisfied.
olsmodel_final_featured = sm.OLS(y_train, x_train7).fit() # Running the final model
print(olsmodel_final_featured.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.844
Model: OLS Adj. R-squared: 0.843
Method: Least Squares F-statistic: 866.5
Date: Sat, 11 May 2024 Prob (F-statistic): 0.00
Time: 00:51:42 Log-Likelihood: 117.57
No. Observations: 2417 AIC: -203.1
Df Residuals: 2401 BIC: -110.5
Df Model: 15
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const -37.5660 7.085 -5.302 0.000 -51.459 -23.673
screen_size 0.0219 0.003 6.929 0.000 0.016 0.028
main_camera_mp 0.0212 0.001 15.319 0.000 0.019 0.024
selfie_camera_mp 0.0149 0.001 14.571 0.000 0.013 0.017
battery -2.037e-05 7.1e-06 -2.870 0.004 -3.43e-05 -6.45e-06
release_year 0.0186 0.004 5.312 0.000 0.012 0.025
normalized_new_price 0.4340 0.010 42.818 0.000 0.414 0.454
weight_log 0.3050 0.032 9.419 0.000 0.241 0.368
brand_name_Infinix -1.9949 0.159 -12.564 0.000 -2.306 -1.684
brand_name_Karbonn 0.1081 0.054 1.999 0.046 0.002 0.214
brand_name_Lenovo 0.0467 0.021 2.180 0.029 0.005 0.089
brand_name_Nokia 0.0659 0.030 2.187 0.029 0.007 0.125
brand_name_Sony -0.0598 0.030 -1.992 0.046 -0.119 -0.001
brand_name_Xiaomi 0.0877 0.025 3.453 0.001 0.038 0.138
os_Others -0.0662 0.029 -2.319 0.020 -0.122 -0.010
4g_yes 0.0469 0.015 3.185 0.001 0.018 0.076
==============================================================================
Omnibus: 250.882 Durbin-Watson: 1.918
Prob(Omnibus): 0.000 Jarque-Bera (JB): 487.120
Skew: -0.673 Prob(JB): 1.67e-106
Kurtosis: 4.740 Cond. No. 5.87e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.87e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
olsmodel_final = sm.OLS(y_train, x_train3).fit() # Running the final model
print(olsmodel_final.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.846
Model: OLS Adj. R-squared: 0.843
Method: Least Squares F-statistic: 288.9
Date: Sat, 11 May 2024 Prob (F-statistic): 0.00
Time: 00:51:42 Log-Likelihood: 130.59
No. Observations: 2417 AIC: -169.2
Df Residuals: 2371 BIC: 97.18
Df Model: 45
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
const -43.0446 9.071 -4.746 0.000 -60.832 -25.257
screen_size 0.0211 0.003 6.536 0.000 0.015 0.027
main_camera_mp 0.0208 0.001 13.927 0.000 0.018 0.024
selfie_camera_mp 0.0145 0.001 13.297 0.000 0.012 0.017
int_memory 9.314e-05 6.91e-05 1.347 0.178 -4.24e-05 0.000
battery -2.068e-05 7.23e-06 -2.862 0.004 -3.49e-05 -6.51e-06
release_year 0.0213 0.004 4.744 0.000 0.012 0.030
days_used 4.255e-05 3.07e-05 1.386 0.166 -1.76e-05 0.000
normalized_new_price 0.4371 0.012 36.678 0.000 0.414 0.460
weight_log 0.3177 0.033 9.558 0.000 0.253 0.383
brand_name_Alcatel 0.0243 0.028 0.856 0.392 -0.031 0.080
brand_name_Apple -0.0462 0.045 -1.033 0.302 -0.134 0.042
brand_name_Asus 0.0222 0.029 0.774 0.439 -0.034 0.078
brand_name_BlackBerry -0.0535 0.058 -0.922 0.357 -0.167 0.060
brand_name_Celkon -0.0725 0.053 -1.358 0.175 -0.177 0.032
brand_name_Coolpad 0.0349 0.062 0.566 0.571 -0.086 0.156
brand_name_Gionee 0.0553 0.043 1.285 0.199 -0.029 0.140
brand_name_Google -0.0085 0.075 -0.113 0.910 -0.156 0.139
brand_name_HTC 0.0010 0.029 0.035 0.972 -0.055 0.057
brand_name_Honor 0.0459 0.030 1.511 0.131 -0.014 0.105
brand_name_Huawei 0.0135 0.022 0.607 0.544 -0.030 0.057
brand_name_Infinix -1.9342 0.169 -11.429 0.000 -2.266 -1.602
brand_name_Karbonn 0.1181 0.055 2.135 0.033 0.010 0.227
brand_name_LG 0.0013 0.024 0.056 0.955 -0.045 0.048
brand_name_Lava 0.0456 0.049 0.929 0.353 -0.051 0.142
brand_name_Lenovo 0.0551 0.024 2.290 0.022 0.008 0.102
brand_name_Meizu 0.0073 0.040 0.182 0.856 -0.072 0.086
brand_name_Micromax -0.0144 0.029 -0.499 0.618 -0.071 0.042
brand_name_Microsoft 0.0961 0.079 1.222 0.222 -0.058 0.250
brand_name_Motorola -0.0035 0.031 -0.115 0.908 -0.064 0.057
brand_name_Nokia 0.0698 0.033 2.089 0.037 0.004 0.135
brand_name_OnePlus 0.1014 0.067 1.517 0.129 -0.030 0.233
brand_name_Oppo 0.0277 0.028 0.990 0.322 -0.027 0.083
brand_name_Panasonic 0.0787 0.040 1.951 0.051 -0.000 0.158
brand_name_Realme 0.0275 0.047 0.580 0.562 -0.066 0.121
brand_name_Samsung -0.0123 0.020 -0.620 0.535 -0.051 0.027
brand_name_Sony -0.0504 0.032 -1.560 0.119 -0.114 0.013
brand_name_Spice -0.0020 0.050 -0.040 0.968 -0.101 0.097
brand_name_Vivo -0.0026 0.029 -0.088 0.930 -0.060 0.055
brand_name_XOLO 0.0490 0.040 1.237 0.216 -0.029 0.127
brand_name_Xiaomi 0.0972 0.029 3.409 0.001 0.041 0.153
brand_name_ZTE 0.0097 0.027 0.357 0.721 -0.044 0.063
os_Others -0.0451 0.030 -1.482 0.139 -0.105 0.015
os_Windows 0.0003 0.045 0.007 0.994 -0.088 0.088
4g_yes 0.0492 0.016 3.130 0.002 0.018 0.080
5g_yes -0.0099 0.028 -0.356 0.722 -0.064 0.045
==============================================================================
Omnibus: 242.033 Durbin-Watson: 1.917
Prob(Omnibus): 0.000 Jarque-Bera (JB): 459.992
Skew: -0.660 Prob(JB): 1.30e-100
Kurtosis: 4.681 Cond. No. 7.61e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.61e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
# Let us write the equation of linear regression
Equation = "Used Phone Price ="
print(Equation, end=" ")
for i in range(len(x_train7.columns)):
if i == 0:
print(np.round(olsmodel_final_featured.params[i], 4), "+", end=" ")
elif i != len(x_train7.columns) - 1:
print(
"(",
np.round(olsmodel_final_featured.params[i], 4),
")*(",
x_train7.columns[i],
")",
"+",
end=" ",
)
else:
print("(", np.round(olsmodel_final_featured.params[i], 4), ")*(", x_train7.columns[i], ")")
Used Phone Price = -37.566 + ( 0.0219 )*( screen_size ) + ( 0.0212 )*( main_camera_mp ) + ( 0.0149 )*( selfie_camera_mp ) + ( -0.0 )*( battery ) + ( 0.0186 )*( release_year ) + ( 0.434 )*( normalized_new_price ) + ( 0.305 )*( weight_log ) + ( -1.9949 )*( brand_name_Infinix ) + ( 0.1081 )*( brand_name_Karbonn ) + ( 0.0467 )*( brand_name_Lenovo ) + ( 0.0659 )*( brand_name_Nokia ) + ( -0.0598 )*( brand_name_Sony ) + ( 0.0877 )*( brand_name_Xiaomi ) + ( -0.0662 )*( os_Others ) + ( 0.0469 )*( 4g_yes )
# Let us write the equation of linear regression
Equation = "Used Phone Price ="
print(Equation, end=" ")
for i in range(len(x_train3.columns)):
if i == 0:
print(np.round(olsmodel_final.params[i], 4), "+", end=" ")
elif i != len(x_train3.columns) - 1:
print(
"(",
np.round(olsmodel_final.params[i], 4),
")*(",
x_train3.columns[i],
")",
"+",
end=" ",
)
else:
print("(", np.round(olsmodel_final.params[i], 4), ")*(", x_train3.columns[i], ")")
Used Phone Price = -43.0446 + ( 0.0211 )*( screen_size ) + ( 0.0208 )*( main_camera_mp ) + ( 0.0145 )*( selfie_camera_mp ) + ( 0.0001 )*( int_memory ) + ( -0.0 )*( battery ) + ( 0.0213 )*( release_year ) + ( 0.0 )*( days_used ) + ( 0.4371 )*( normalized_new_price ) + ( 0.3177 )*( weight_log ) + ( 0.0243 )*( brand_name_Alcatel ) + ( -0.0462 )*( brand_name_Apple ) + ( 0.0222 )*( brand_name_Asus ) + ( -0.0535 )*( brand_name_BlackBerry ) + ( -0.0725 )*( brand_name_Celkon ) + ( 0.0349 )*( brand_name_Coolpad ) + ( 0.0553 )*( brand_name_Gionee ) + ( -0.0085 )*( brand_name_Google ) + ( 0.001 )*( brand_name_HTC ) + ( 0.0459 )*( brand_name_Honor ) + ( 0.0135 )*( brand_name_Huawei ) + ( -1.9342 )*( brand_name_Infinix ) + ( 0.1181 )*( brand_name_Karbonn ) + ( 0.0013 )*( brand_name_LG ) + ( 0.0456 )*( brand_name_Lava ) + ( 0.0551 )*( brand_name_Lenovo ) + ( 0.0073 )*( brand_name_Meizu ) + ( -0.0144 )*( brand_name_Micromax ) + ( 0.0961 )*( brand_name_Microsoft ) + ( -0.0035 )*( brand_name_Motorola ) + ( 0.0698 )*( brand_name_Nokia ) + ( 0.1014 )*( brand_name_OnePlus ) + ( 0.0277 )*( brand_name_Oppo ) + ( 0.0787 )*( brand_name_Panasonic ) + ( 0.0275 )*( brand_name_Realme ) + ( -0.0123 )*( brand_name_Samsung ) + ( -0.0504 )*( brand_name_Sony ) + ( -0.002 )*( brand_name_Spice ) + ( -0.0026 )*( brand_name_Vivo ) + ( 0.049 )*( brand_name_XOLO ) + ( 0.0972 )*( brand_name_Xiaomi ) + ( 0.0097 )*( brand_name_ZTE ) + ( -0.0451 )*( os_Others ) + ( 0.0003 )*( os_Windows ) + ( 0.0492 )*( 4g_yes ) + ( -0.0099 )*( 5g_yes )
Predict Used Phone Price
We will utilize the metric functions provided by scikit-learn (sklearn) to calculate the Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared (R2) scores.
We will create a function to compute the Mean Absolute Percentage Error (MAPE) and the adjusted R-squared.
The mean absolute percentage error (MAPE) quantifies the accuracy of predictions as a percentage. It is calculated by taking the average of the absolute percent errors between predicted values and actual values, divided by the actual values. MAPE is particularly useful when extreme values are absent from the data and none of the actual values are zero.
We will create a function that consolidates all the above metrics into a single output.
# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
r2 = r2_score(targets, predictions)
n = predictors.shape[0]
k = predictors.shape[1]
return 1 - ((1 - r2) * (n - 1) / (n - k - 1))
# function to compute MAPE
def mape_score(targets, predictions):
return np.mean(np.abs(targets - predictions) / targets) * 100
# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
"""
Function to compute different metrics to check regression model performance
model: regressor
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
r2 = r2_score(target, pred) # to compute R-squared
adjr2 = adj_r2_score(predictors, target, pred) # to compute adjusted R-squared
rmse = np.sqrt(mean_squared_error(target, pred)) # to compute RMSE
mae = mean_absolute_error(target, pred) # to compute MAE
mape = mape_score(target, pred) # to compute MAPE
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"RMSE": rmse,
"MAE": mae,
"R-squared": r2,
"Adj. R-squared": adjr2,
"MAPE": mape,
},
index=[0],
)
return df_perf
# predictions on the test set
pred = olsmodel_final.predict(x_test3)
df_pred_test = pd.DataFrame({"Actual": y_test, "Predicted": pred})
df_pred_test.sample(10, random_state=1)
| Actual | Predicted | |
|---|---|---|
| 1995 | 4.566741 | 4.364696 |
| 2341 | 3.696103 | 3.969256 |
| 1913 | 3.592093 | 3.602651 |
| 688 | 4.306495 | 4.106764 |
| 650 | 4.522115 | 5.129626 |
| 2291 | 4.259294 | 4.390414 |
| 40 | 4.997685 | 5.391301 |
| 1884 | 3.875359 | 4.023592 |
| 2538 | 4.206631 | 4.003159 |
| 45 | 5.380450 | 5.275084 |
# predictions on the test set
pred = olsmodel_final_featured.predict(x_test7)
df_pred_test = pd.DataFrame({"Actual": y_test, "Predicted": pred})
df_pred_test.sample(10, random_state=1)
| Actual | Predicted | |
|---|---|---|
| 1995 | 4.566741 | 4.371731 |
| 2341 | 3.696103 | 3.971332 |
| 1913 | 3.592093 | 3.636821 |
| 688 | 4.306495 | 4.124797 |
| 650 | 4.522115 | 5.176730 |
| 2291 | 4.259294 | 4.399474 |
| 40 | 4.997685 | 5.384732 |
| 1884 | 3.875359 | 4.040823 |
| 2538 | 4.206631 | 4.029145 |
| 45 | 5.380450 | 5.274014 |
Our model is working as expected.
Actual and predicted amounts are extremely comparable.
df3 = df_pred_test.sample(25, random_state=1)
df3.plot(kind="bar", figsize=(15, 7))
plt.show()
<Axes: >
x_train7.columns # checking the columns of final train data
Index(['const', 'screen_size', 'main_camera_mp', 'selfie_camera_mp', 'battery',
'release_year', 'normalized_new_price', 'weight_log',
'brand_name_Infinix', 'brand_name_Karbonn', 'brand_name_Lenovo',
'brand_name_Nokia', 'brand_name_Sony', 'brand_name_Xiaomi', 'os_Others',
'4g_yes'],
dtype='object')
x_test7.columns # checking the columns of final test data
Index(['const', 'screen_size', 'main_camera_mp', 'selfie_camera_mp', 'battery',
'release_year', 'normalized_new_price', 'weight_log',
'brand_name_Infinix', 'brand_name_Karbonn', 'brand_name_Lenovo',
'brand_name_Nokia', 'brand_name_Sony', 'brand_name_Xiaomi', 'os_Others',
'4g_yes'],
dtype='object')
# checking model performance on train set (seen 70% data)
print("Training Performance\n")
olsmodel_final_train_perf = model_performance_regression(olsmodel_final, x_train3, y_train)
olsmodel_final_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.229244 | 0.179283 | 0.845748 | 0.842754 | 4.304175 |
# checking model performance on train set (seen 70% data)
print("Training Performance\n")
olsmodel_final_featured_train_perf = model_performance_regression(olsmodel_final_featured, x_train7, y_train)
olsmodel_final_featured_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.230483 | 0.180064 | 0.844077 | 0.843037 | 4.321079 |
# checking model performance on test set (seen 30% data)
print("Test Performance\n")
olsmodel_final_test_perf = model_performance_regression(olsmodel_final, x_test3, y_test)
olsmodel_final_test_perf
Test Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.237867 | 0.184334 | 0.843127 | 0.835838 | 4.493195 |
# checking model performance on test set (seen 30% data)
print("Test Performance\n")
olsmodel_final_featured_test_perf = model_performance_regression(olsmodel_final_featured, x_test7, y_test)
olsmodel_final_featured_test_perf
Test Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.236896 | 0.183564 | 0.844405 | 0.841964 | 4.473149 |
Observations
Based on olsmodel_final_featured.
The model is able to explain ~84.4% of the variation in the data, which is very good.
The train and test RMSE and MAE are low and comparable. Our model is not suffering from overfitting. The model is performing well and making accurate predictions.
The MAPE on the test set suggests the model’s predictions have an absolute percentage error of approximately 4.32% of the used_price_log.
We can confidently state that the OLS model olsmodel_final is suitable for both prediction and inference purposes.
# training performance comparison
models_train_comp_df = pd.concat(
[olsmodel1st_train_perf.T, olsmodel_final_train_perf.T],
axis=1,
)
models_train_comp_df.columns = [
"Linear Regression sklearn",
"Linear Regression statsmodels",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Linear Regression sklearn | Linear Regression statsmodels | |
|---|---|---|
| RMSE | 0.229178 | 0.229244 |
| MAE | 0.179158 | 0.179283 |
| R-squared | 0.845837 | 0.845748 |
| Adj. R-squared | 0.842713 | 0.842754 |
| MAPE | 4.301420 | 4.304175 |
# training performance comparison
models_train_comp_df = pd.concat(
[olsmodel1st_train_perf.T, olsmodel_final_featured_train_perf.T],
axis=1,
)
models_train_comp_df.columns = [
"Linear Regression sklearn",
"Linear Regression statsmodels",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Linear Regression sklearn | Linear Regression statsmodels | |
|---|---|---|
| RMSE | 0.229178 | 0.230483 |
| MAE | 0.179158 | 0.180064 |
| R-squared | 0.845837 | 0.844077 |
| Adj. R-squared | 0.842713 | 0.843037 |
| MAPE | 4.301420 | 4.321079 |
# test performance comparison
models_test_comp_df = pd.concat(
[olsmodel1st_test_perf.T, olsmodel_final_test_perf.T],
axis=1,
)
models_test_comp_df.columns = [
"Linear Regression sklearn",
"Linear Regression statsmodels",
]
print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
| Linear Regression sklearn | Linear Regression statsmodels | |
|---|---|---|
| RMSE | 0.237835 | 0.237867 |
| MAE | 0.184359 | 0.184334 |
| R-squared | 0.843169 | 0.843127 |
| Adj. R-squared | 0.835550 | 0.835838 |
| MAPE | 4.493813 | 4.493195 |
Not any significan differences between the two models.
Interpretation of R-squared
The R-squared value tells us that our model can explain 84.4% of the variance in the training set.
Interpretation of Coefficients
The following factors are influencing the price of a used phone based on our model: screen_size, main_camera_mp, selfie_camera_mp, battery, release_year, normalized_new_price, weight_log, brand_name_Infinix, brand_name_Karbonn, brand_name_Lenovo, brand_name_Nokia, brand_name_Sony, brand_name_Xiaomi, os_Others, and 4g_yes.
The normalized_new_price is statistically significant in determining the used price. 1 euro increase in price of new phone will lead to an increase price of used phone by 0.4340 euros.
Battery size, os_Others, brand_name_Infinix, and brand_name_Sony has a negtive impact on the price of a used phone.
After normalized_new_price: main_camera_mp, selfie_camera_mp, brand_name_Infinix, and weight_log have the biggest impact on the price of a used phone.
Based on the heat map produced on the original data I would have assumed screen_size, selfie_camera_mp, battery, and release_year would have more of an impact than they do on the price of a used phone.
Interpretation of p-values (P > |t|)
For each predictor variable there is a null hypothesis and alternate hypothesis.
Null hypothesis : Predictor variable is not significant
There are no variables with the P_value>0.05, hence we can say the predictor variable is significant in this data set.
Linear Regression Assumptions
All of the assumptions for linear regression were met for the model
Recommendation
Although the olsmodel_final model shows a R-squared rate of 84.6% we recommend using the olsmodel_final_featured model which has an R-squared of 84.4%. This is because the olsmodel_final_featured are giving better predicted cost vs actual than the oldmodel_final.
The olsmodel_final_featured model can be used to accurately predict the price of the used phone.
Investments in the brands that have negative coefficents should be reduced as they do not have a good used/resale price.
Phones with higher new prices should be focused on more as they have higher used/resale prices.coefficients
%%shell
jupyter nbconvert --to html //'/content/drive/MyDrive/Python_Course/Project_3/SLF_Project_LearnerNotebook_FullCode.ipynb'
[NbConvertApp] Converting notebook ///content/drive/MyDrive/Python_Course/Project_3/SLF_Project_LearnerNotebook_FullCode.ipynb to html [NbConvertApp] Writing 5811889 bytes to /content/drive/MyDrive/Python_Course/Project_3/SLF_Project_LearnerNotebook_FullCode.html
-